TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox


The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:


This becomes, in the case of the square root of n, = x2 - n:where f(x)


An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:


There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:


the first two terms of which yield, in the case where f(x) = SQRT(x):


The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:



In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1


The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   long double   num, num2;
   long startTime, endTime, time;
   short i;

   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   while (1);                              // repeat until Quit


PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.

The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.


Community Search:
MacTech Search:

Software Updates via MacUpdate

Xcode 8.1 - Integrated development envir...
Xcode includes everything developers need to create great applications for Mac, iPhone, iPad, and Apple Watch. Xcode provides developers a unified workflow for user interface design, coding, testing... Read more
Apple Numbers 4.0.5 - Apple's sprea...
With Apple Numbers, sophisticated spreadsheets are just the start. The whole sheet is your canvas. Just add dramatic interactive charts, tables, and images that paint a revealing picture of your data... Read more
Paperless 2.3.7 - $49.95
Paperless is a digital documents manager. Remember when everyone talked about how we would soon be a paperless society? Now it seems like we use paper more than ever. Let's face it - we need and we... Read more
DEVONthink Pro 2.9.6 - Knowledge base, i...
DEVONthink Pro is your essential assistant for today's world, where almost everything is digital. From shopping receipts to important research papers, your life often fills your hard drive in the... Read more
Safari Technology Preview 10.1 - The new...
Safari Technology Preview contains the most recent additions and improvements to WebKit and the latest advances in Safari web technologies. And once installed, you will receive notifications of... Read more
VueScan 9.5.60 - Scanner software with a...
VueScan is a scanning program that works with most high-quality flatbed and film scanners to produce scans that have excellent color fidelity and color balance. VueScan is easy to use, and has... Read more
Civilization VI 1.0.0 - Next iteration o...
Sid Meier’s Civilization VI is the next entry in the popular Civilization franchise. Originally created by legendary game designer Sid Meier, Civilization is a strategy game in which you attempt to... Read more
Adobe Flash Player - Plug-in...
Adobe Flash Player is a cross-platform, browser-based application runtime that provides uncompromised viewing of expressive applications, content, and videos across browsers and operating systems.... Read more
RestoreMeNot 2.0.4 - Disable window rest...
RestoreMeNot provides a simple way to disable the window restoration for individual applications so that you can fine-tune this behavior to suit your needs. Please note that RestoreMeNot is designed... Read more
Persecond 1.0.7 - Timelapse video made e...
Persecond is the easy, fun way to create a beautiful timelapse video. Import an image sequence from any camera, trim the length of your video, adjust the speed and playback direction, and you’re done... Read more

Latest Forum Discussions

See All

The Forgotten Room (Games)
The Forgotten Room 1.0.1 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0.1 (iTunes) Description: Play as paranormal investigator John “Buster of Ghosts” Murr as he explores yet another mysteriously creepy house. This... | Read more »
5 Halloween mobile games for wimps
If you're anything like me, horror games are a great way to have nightly nightmares for the next decade or three. They're off limits, but perhaps you want to get in on the Halloween celebrations in some way. Fortunately not all Halloween themed... | Read more »
The 5 scariest mobile games
It's the most wonderful time of the year for people who enjoy scaring themselves silly with haunted houses, movies, video games, and what have you. Mobile might not be the first platform you'd turn to for quality scares, but rest assured there are... | Read more »
Lifeline: Flatline (Games)
Lifeline: Flatline 1.0.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0.0 (iTunes) Description: The Lifeline series takes a terrifying turn in this interactive horror experience. Every decision you make could help... | Read more »
Game of Dice is now available on Faceboo...
After celebrating its anniversary in style with a brand new update, there’s even more excitement in store for Game of Dice has after just being launched on Facebook Gameroom. A relatively new platform, Facebook Gameroom has been designed for PC... | Read more »
4 addictive clicker games like Best Fien...
Clickers are passive games that take advantage of basic human psychology to suck you in, and they're totally unashamed of that. As long as you're aware that this game has been created to take hold of your brain and leave you perfectly content to... | Read more »
Smile Inc. Guide: How not to die on the...
As if Mondays weren't bad enough, at Smile Inc. you have to deal with giant killer donuts, massive hungry staplers, and blasting zones. It's not exactly a happy, thriving work environment. In fact, you'll be lucky to survive the nine to five.... | Read more »
Oh...Sir! The Insult Simulator (Games)
Oh...Sir! The Insult Simulator 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: | Read more »
WitchSpring2 (Games)
WitchSpring2 1.27 Device: iOS Universal Category: Games Price: $3.99, Version: 1.27 (iTunes) Description: This is the story of Luna, the Moonlight Witch as she sets out into the world. This is a sequel to Witch Spring. Witch Spring 2... | Read more »
4 popular apps getting a Halloween makeo...
'Tis the season for all things spooky. So much, so, in fact, that even apps are getting into the spirt of things, dressing up in costume and spreading jack o' lanterns all about the place. These updates bring frightening new character skins, scary... | Read more »

Price Scanner via

Apple reveals new next-generation 15″ and 13″...
Apple today revealed their next-generation 15″ and 13″ MacBook Pros. The new models are thinner and lighter than before with a new aluminum design featuring an enhanced keyboard with retina, multi-... Read more
Worldwide Smartphone Shipments Up 1.0% Year o...
According to preliminary results from the International Data Corporation (IDC) Worldwide Quarterly Mobile Phone Tracker, vendors shipped a total of 362.9 million smartphones worldwide in the third... Read more
TuneBand Arm Band For iPhone 7 and 7 Plus Rel...
Grantwood Technology has added the TuneBand for iPhone 7 and 7 Plus to its smartphone armband series. The TuneBand provides a lightweight and comfortable way to wear the iPhone while running,... Read more
1.4GHz Mac mini on sale for $449, save $50
Adorama has the 1.4GHz Mac mini on sale for $50 off MSRP including free shipping plus NY & NJ sales tax only: - 1.4GHz Mac mini (Apple sku# MGEM2LL/A): $449 $50 off MSRP To purchase a mini at... Read more
21-inch 1.6GHz iMac on sale for $999, save $1...
B&H has the 21″ 1.6GHz Apple iMac on sale for $999 including free shipping plus NY sales tax only. Their price is $100 off MSRP. Read more
Macs’ Superior Enterprise Deployment Cost Eco...
IBM’s debunking of conventional wisdom and popular mythology about the relative cost of using Apple Mac computers as opposed to PCs running Microsoft Windows at the sixth annual Jamf Nation User... Read more
12-inch WiFi Apple iPad Pros on sale for $50-...
B&H Photo has 12″ WiFi Apple iPad Pros on sale for $50-$70 off MSRP, each including free shipping. B&H charges sales tax in NY only: - 12″ Space Gray 32GB WiFi iPad Pro: $749 $50 off MSRP -... Read more
Apple refurbished 12-inch 128GB iPad Pros ava...
Apple has Certified Refurbished 12″ Apple iPad Pros available for up to $160 off the cost of new iPads. An Apple one-year warranty is included with each model, and shipping is free: - 32GB 12″ iPad... Read more
Apple refurbished iPad minis and iPad Air 2s...
Apple recently dropped prices on several Certified Refurbished iPad mini 4s and 2s as well as iPad Air 2s. An Apple one-year warranty is included with each model, and shipping is free: - 16GB iPad... Read more
MacHTTP-js Preview Full-featured Web Server f...
MacHTTP.Org has released MacHTTP-js Preview for macOS, a full-featured Web server for 21st Century desktops and servers. MacHTTP-js is a modern take on the classic stand-alone, desktop computer Web... Read more

Jobs Board

*Apple* Retail - Multiple Positions - Apple,...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
Software Engineering Intern: UI Applications...
Job Summary Apple is currently seeking enthusiastic interns who can work full-time for a minimum of 12-weeks between Fall 2015 and Summer 2016. Our software Read more
Security Data Analyst - *Apple* Information...
…data sources need to be collected to allow Information Security to better protect Apple employees and customers from a wide range of threats.Act as the subject Read more
*Apple* Retail - Multiple Positions - Apple,...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
*Apple* Solutions Consultant - Apple (United...
# Apple Solutions Consultant Job Number: 52812872 Houston, Texas, United States Posted: Oct. 18, 2016 Weekly Hours: 40.00 **Job Summary** As an Apple Solutions Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.