TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox


The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:


This becomes, in the case of the square root of n, = x2 - n:where f(x)


An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:


There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:


the first two terms of which yield, in the case where f(x) = SQRT(x):


The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:



In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1


The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   long double   num, num2;
   long startTime, endTime, time;
   short i;

   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   while (1);                              // repeat until Quit


PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.

The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.


Community Search:
MacTech Search:

Software Updates via MacUpdate

Microsoft Office 2016 16.11 - Popular pr...
Microsoft Office 2016 - Unmistakably Office, designed for Mac. The new versions of Word, Excel, PowerPoint, Outlook, and OneNote provide the best of both worlds for Mac users - the familiar Office... Read more
Adobe Photoshop CC 2018 19.1.2 - Profess...
Photoshop CC 2018 is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous Photoshop customer). Adobe Photoshop CC 2018, the industry standard... Read more
Adobe Dreamweaver CC 2018 -...
Dreamweaver CC 2018 is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous Dreamweaver customer). Adobe Dreamweaver CC 2018 allows you to... Read more
Adobe Flash Player - Plug-in...
Adobe Flash Player is a cross-platform, browser-based application runtime that provides uncompromised viewing of expressive applications, content, and videos across browsers and operating systems.... Read more
Drive Genius 5.2.0 - $79.00
Drive Genius features a comprehensive Malware Scan. Automate your malware protection. Protect your investment from any threat. The Malware Scan is part of the automated DrivePulse utility. DrivePulse... Read more
MegaSeg 6.0.6 - Professional DJ and radi...
MegaSeg is a complete solution for pro audio/video DJ mixing, radio automation, and music scheduling with rock-solid performance and an easy-to-use design. Mix with visual waveforms and Magic... Read more
ffWorks 1.0.7 - Convert multimedia files...
ffWorks (was iFFmpeg), focused on simplicity, brings a fresh approach to the use of FFmpeg, allowing you to create ultra-high-quality movies without the need to write a single line of code on the... Read more
Dash 4.1.5 - Instant search and offline...
Dash is an API documentation browser and code snippet manager. Dash helps you store snippets of code, as well as instantly search and browse documentation for almost any API you might use (for a full... Read more
Evernote 7.0.3 - Create searchable notes...
Evernote allows you to easily capture information in any environment using whatever device or platform you find most convenient, and makes this information accessible and searchable at anytime, from... Read more
jAlbum Pro 15.3 - Organize your digital...
jAlbum Pro has all the features you love in jAlbum, but comes with a commercial license. You can create gorgeous custom photo galleries for the Web without writing a line of code! Beginner-friendly... Read more

Latest Forum Discussions

See All

All the best games on sale for iPhone an...
It might not have been the greatest week for new releases on the App Store, but don't let that get you down, because there are some truly incredible games on sale for iPhone and iPad right now. Seriously, you could buy anything on this list and I... | Read more »
Everything You Need to Know About The Fo...
In just over a week, Epic Games has made a flurry of announcements. First, they revealed that Fortnite—their ultra-popular PUBG competitor—is coming to mobile. This was followed by brief sign-up period for interested beta testers before sending out... | Read more »
The best games that came out for iPhone...
It's not been the best week for games on the App Store. There are a few decent ones here and there, but nothing that's really going to make you throw down what you're doing and run to the nearest WiFi hotspot in order to download it. That's not to... | Read more »
Death Coming (Games)
Death Coming Device: iOS Universal Category: Games Price: $1.99, Version: (iTunes) Description: --- Background Story ---You Died. Pure and simple, but death was not the end. You have become an agent of Death: a... | Read more »
Hints, tips, and tricks for Empires and...
Empires and Puzzles is a slick match-stuff RPG that mixes in a bunch of city-building aspects to keep things fresh. And it's currently the Game of the Day over on the App Store. So, if you're picking it up for the first time today, we thought it'd... | Read more »
What You Need to Know About Sam Barlow’s...
Sam Barlow’s follow up to Her Story is #WarGames, an interactive video series that reimagines the 1983 film WarGames in a more present day context. It’s not exactly a game, but it’s definitely still interesting. Here are the top things you should... | Read more »
Pixel Plex Guide - How to Build Better T...
Pixel Plex is the latest city builder that has come to the App Store, and it takes a pretty different tact than the ones that came before it. Instead of being in charge of your own city by yourself, you have to work together with other players to... | Read more »
Fortnite Will Be Better Than PUBG on Mob...
Before last week, if you asked me which game I prefer between Fortnite Battle Royale and PlayerUnknown’s Battlegrounds (PUBG), I’d choose the latter just about 100% of the time. Now that we know that both games are primed to hit our mobile screens... | Read more »
Siege of Dragonspear (Games)
Siege of Dragonspear 2.5.12 Device: iOS Universal Category: Games Price: $9.99, Version: 2.5.12 (iTunes) Description: Experience the Siege of Dragonspear, an epic Baldur’s Gate tale, filled with with intrigue, magic, and monsters.... | Read more »
7 Wonders Guide - Should You Buy The Lea...
The fantastic mobile version of 7 Wonders just got updated with an expansion that adds Leaders to the game. This new content adds a whole layer of depth to the game, but before you spend $1.99 to buy it blindly, check out this breakdown of exactly... | Read more »

Price Scanner via

B&H drops prices on 15″ MacBook Pros up t...
B&H Photo has dropped prices on new 2017 15″ MacBook Pros, now up to $300 off MSRP and matching Adorama’s price drop yesterday. Shipping is free, and B&H charges sales tax for NY & NJ... Read more
Apple restocks Certified Refurbished 2017 13″...
Apple has restocked Certified Refurbished 2017 13″ 2.3GHz MacBook Pros for $200-$230 off MSRP. A standard Apple one-year warranty is included with each MacBook, models receive new outer cases, and... Read more
13″ Space Gray Touch Bar MacBook Pros on sale...
Adorama has new 2017 13″ Space Gray Touch Bar MacBook Pros on sale for $150 off MSRP. Shipping is free, and Adorama charges sales tax in NY & NJ only: – 13″ 3.1GHz/256GB Space Gray MacBook Pro (... Read more
Best deal of the year on 15″ Apple MacBook Pr...
Adorama has New 2017 15″ MacBook Pros on sale for up to $300 off MSRP. Shipping is free, and Adorama charges sales tax in NJ and NY only: – 15″ 2.8GHz Touch Bar MacBook Pro Space Gray (MPTR2LL/A): $... Read more
Save $100-$150+ on 13″ Touch Bar MacBook Pros...
B&H Photo has 13″ Touch Bar MacBook Pros on sale for $100-$150 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 13″ 3.1GHz/256GB Space Gray MacBook Pro... Read more
Current deals on 27″ Apple iMacs, models up t...
B&H Photo has 27″ iMacs on sale for up to $150 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 27″ 3.8GHz iMac (MNED2LL/A): $2149 $150 off MSRP – 27″ 3... Read more
Thursday Deal: 13″ 2.3GHz MacBook Pro for $11...
B&H Photo has the 13″ 2.3GHz/128GB Space Gray MacBook Pro on sale for $100 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 13-inch 2.3GHz/128GB Space... Read more
How to save $100-$190 on 10″ & 12″ iPad P...
Apple is now offering Certified Refurbished 2017 10″ and 12″ iPad Pros for $100-$190 off MSRP, depending on the model. An Apple one-year warranty is included with each model, and shipping is free: –... Read more
Silver 12″ 1.3GHz MacBook on sale at B&H...
B&H Photo has the 2017 12″ 1.3GHz Silver MacBook on sale for $1399.99 including free shipping plus sales tax for NY & NJ residents only. Their price is $200 off MSRP, and it’s the lowest... Read more
Amazon offers 21″ Apple iMacs for up to $150...
Amazon 21″ iMacs on sale today for $50-$150 off MSRP, depending on the model. Shipping is free: – 21″ 3.4GHz 4K iMac (MNE02LL/A): $1349.99 $150 off MSRP – 21″ 3.0GHz iMac (MNDY2LL/A): $1199 $100 off... Read more

Jobs Board

Data Scientist, *Apple* Ecosystem (AMP Anal...
# Data Scientist, Apple Ecosystem (AMP Analytics) Job Number: 113428728 Santa Clara Valley, California, United States Posted: 24-Jan-2018 Weekly Hours: 40.00 **Job Read more
*Apple* Solutions Consultant - Apple (United...
# Apple Solutions Consultant Job Number: 113523441 Orange, CA, California, United States Posted: 21-Feb-2018 Weekly Hours: 40.00 **Job Summary** Are you passionate Read more
*Apple* Professional Learning Specialist - A...
# Apple Professional Learning Specialist Job Number: 113456892 Englewood, NJ, New Jersey, United States Posted: 02-Feb-2018 Weekly Hours: 40.00 **Job Summary** The Read more
*Apple* Retail Online, Senior Financial Anal...
# Apple Retail Online, Senior Financial Analyst Job Number: 57228835 Santa Clara Valley, California, United States Posted: 12-Feb-2018 Weekly Hours: 40.00 **Job Read more
Site Reliability Engineer, *Apple* Pay - Ap...
# Site Reliability Engineer, Apple Pay Job Number: 113356036 Santa Clara Valley, California, United States Posted: 12-Jan-2018 Weekly Hours: 40.00 **Job Summary** Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.