TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox


The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:


This becomes, in the case of the square root of n, = x2 - n:where f(x)


An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:


There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:


the first two terms of which yield, in the case where f(x) = SQRT(x):


The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:



In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1


The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   long double   num, num2;
   long startTime, endTime, time;
   short i;

   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   while (1);                              // repeat until Quit


PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.

The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.


Community Search:
MacTech Search:

Software Updates via MacUpdate

Nisus Writer Pro 2.1.2 - Multilingual wo...
Nisus Writer Pro is a powerful multilingual word processor, similar to its entry level products, but brings new features such as table of contents, indexing, bookmarks, widow and orphan control,... Read more
calibre 2.40.0 - Complete e-book library...
Calibre is a complete e-book library manager. Organize your collection, convert your books to multiple formats, and sync with all of your devices. Let Calibre be your multi-tasking digital librarian... Read more
Vivaldi - An advanced browser...
Vivaldi is a browser for our friends. In 1994, two programmers started working on a web browser. Our idea was to make a really fast browser, capable of running on limited hardware, keeping in mind... Read more
OmniPlan 3.0 - Robust project management...
With OmniPlan, you can create logical, manageable project plans with Gantt charts, schedules, summaries, milestones, and critical paths. Break down the tasks needed to make your project a success,... Read more
Yummy FTP 1.11 - FTP/SFTP/FTPS client fo...
Yummy FTP is an FTP + SFTP + FTPS file transfer client which focuses on speed, reliability and productivity. Whether you need to transfer a few files or a few thousand, schedule automatic backups, or... Read more
Tweetbot 2.1 - Popular Twitter client. (...
Tweetbot is a full-featured OS X Twitter client with a lot of personality. Whether it's the meticulously-crafted interface, sounds and animation, or features like multiple timelines and column views... Read more
MacPilot 8.0 - Enable over 1,200 hidden...
MacPilot gives you the power of UNIX and the simplicity of Macintosh, which means a phenomenal amount of untapped power in your hands! Use MacPilot to unlock over 1,200 features, and access them all... Read more
Typinator 6.7 - Speedy and reliable text...
Typinator turbo-charges your typing productivity. Type a little. Typinator does the rest. We've all faced projects that require repetitive typing tasks. With Typinator, you can store commonly used... Read more
Adobe Lightroom 6.2 - Import, develop, a...
Adobe Lightroom is available as part of Adobe Creative Cloud for as little as $9.99/month bundled with Photoshop CC as part of the photography package. Lightroom 6 is also available for purchase as a... Read more
ForeverSave 2.1.4 - Universal auto-save...
ForeverSave auto-saves all documents you're working on while simultaneously doing backup versioning in the background. Lost data can be quickly restored at any time. Losing data, caused by... Read more

Sago Mini Babies (Education)
Sago Mini Babies 1.0 Device: iOS Universal Category: Education Price: $2.99, Version: 1.0 (iTunes) Description: Introducing the Sago Mini babies. Boys and girls love caring for these adorable characters. Feed Robin her favorite mush... | Read more »
PAUSE - Relaxation at your fingertip (H...
PAUSE - Relaxation at your fingertip 1.1 Device: iOS iPhone Category: Healthcare & Fitness Price: $1.99, Version: 1.1 (iTunes) Description: | Read more »
Super Sharp (Games)
Super Sharp 1.1 Device: iOS Universal Category: Games Price: $1.99, Version: 1.1 (iTunes) Description: Your finger has never been so sharp! Cut with skill to complete the 120 ingenious physics levels of Super Sharp and become a cut... | Read more »
Assembly - Graphic design for everyone...
Assembly - Graphic design for everyone 1.0 Device: iOS Universal Category: Photography Price: $2.99, Version: 1.0 (iTunes) Description: Assembly is the easiest most powerful design tool on the App Store. Create anything you can... | Read more »
Dub Dash (Games)
Dub Dash 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: ARE YOU READY FOR THE ULTIMATE CHALLENGE? UNIQUE SYMBIOSIS OF MUSIC AND GRAPHICS | Read more »
Leave Me Alone (Games)
Leave Me Alone 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: 33% off launch sale!!! Somewhere between the 1980s and 1990s there exists a world that never was. A world of skatepunks,... | Read more »
YAMGUN (Games)
YAMGUN 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: The invasion has begun! Protect the walls of the citadel against waves of enemies! But watch out, you will soon run out of ammo...... | Read more »
Chesh (Games)
Chesh 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: It’s like chess, only not at all. ***40% off for a limited time to celebrate our launch!*** Chesh is a game of skill, strategy, luck,... | Read more »
Dust: An Elysian Tail (Games)
Dust: An Elysian Tail 1.051 Device: iOS Universal Category: Games Price: $5.99, Version: 1.051 (iTunes) Description: *Dust: An Elysian Tail is a visually rich experience and thus requires devices with 1GB RAM**Please note that... | Read more »
Royal Bounty HD (Games)
Royal Bounty HD 1.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: New World Computing Approved "Hi Guys! looks good so far! keep up the good work. I worked on HoMM 3 and 4 creating all of the... | Read more »

Price Scanner via

Apple’s Education discount saves up to $300 o...
Purchase a new Mac or iPad using Apple’s Education Store and take up to $300 off MSRP. All teachers, students, and staff of any educational institution qualify for the discount. Shipping is free, and... Read more
Save up to $350 with Apple refurbished iMacs
Apple has Certified Refurbished iMacs available for up to $350 off the cost of new models. Apple’s one-year warranty is standard, and shipping is free: - 27″ 3.5GHz 5K iMac – $1949 $350 off MSRP - 27... Read more
Mac Pros on sale for up to $300 off MSRP
B&H Photo has Mac Pros on sale for up to $300 off MSRP. Shipping is free, and B&H charges sales tax in NY only: - 3.7GHz 4-core Mac Pro: $2818.99, $181 off MSRP - 3.5GHz 6-core Mac Pro: $3699... Read more
5K iMacs on sale for up to $150 off MSRP, fre...
B&H Photo has the 27″ 3.3GHz 5K iMac on sale for $1899.99 including free shipping plus NY tax only. Their price is $100 off MSRP. They have the 27″ 3.5GHz 5K iMac on sale for $2149, $150 off MSRP... Read more
Twelve South Redesigns BookArc For Today’s Sm...
Twelve South has announced a redesigned version of their very first product, BookArc for MacBook. Tailored specifically for the newest generation of MacBooks, BookArc holds the new, smaller Apple... Read more
Phone 6s Tips & Tricks – Tips Book For iP...
Poole, United Kingdom based Tap Guides Ltd. has announced the release and immediate availability of iPhone 6s Tips & Tricks, an in-depth eBook available in the iBookstore that’s priced just $2.99... Read more
Apple refurbished 2014 13-inch Retina MacBook...
Apple has Certified Refurbished 2014 13″ Retina MacBook Pros available for up to $400 off original MSRP, starting at $979. An Apple one-year warranty is included with each model, and shipping is free... Read more
13-inch 2.5GHz MacBook Pro on sale for $994,...
Best Buy has the 13″ 2.5GHz MacBook Pro available for $994.99 on their online store. Choose free shipping or free instant local store pickup (if available). Their price is $105 off MSRP. Price valid... Read more
Is The iPad Pro Really A Serious Laptop Repla...
Probably not, at least for productive professionals and other power users. Steve Jobs declared that we’d entered the “post-PC Era” with the advent of the original iPad in 2010, a phrase we don’t hear... Read more
Wednesday Deal: 13-inch Retina MacBook Pros f...
Adorama has 13″ Retina MacBook Pros on sale for up to $130 off MSRP. Shipping is free, and Adorama charges sales tax for NY & NJ residents only: - 13″ 2.7GHz/128GB Retina MacBook Pro: $1199.99 $... Read more

Jobs Board

*Apple* Retail Online, Customer Experience R...
**Job Summary** Apple Retail's Online Store sells Apple products to customers around the world. In addition to selling Apple products with unique services such Read more
Frameworks Engineer, *Apple* Watch - Apple...
**Job Summary** Join the team that is shaping the future of software development for Apple Watch! As a software engineer on the Apple Watch Frameworks team you will Read more
Senior Manager, Global Direct Marketing *App...
**Job Summary** Apple Retail is looking for an experienced Direct Marketing Leader to join its Marketing team. This position will take a leadership role in creating Read more
*Apple* Online Store Expansion - Apple (Unit...
**Job Summary** The Online Apple Store is seeking a person to lead its expansion into new countries. Based in Cupertino, CA, this person will develop and maintain an Read more
*Apple* Retail - Multiple Customer Support P...
Job Description:Customer SupportSpecialist - Retail Customer Service and SalesTransform Apple Store visitors into loyal Apple customers. When customers enter the Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.