TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 
AAPL
$442.14
Apple Inc.
+0.79
MSFT
$34.15
Microsoft Corpora
-0.46
GOOG
$882.79
Google Inc.
-6.63

MacTech Search:
Community Search:

Software Updates via MacUpdate

Evernote 5.1.1 - Create searchable notes...
Evernote allows you to easily capture information in any environment using whatever device or platform you find most convenient, and makes this information accessible and searchable at anytime, from... Read more
SketchUp 13.0.3688 - Create 3D design co...
SketchUp is an easy-to-learn 3D modeling program that enables you to explore the world in 3D. With just a few simple tools, you can create 3D models of houses, sheds, decks, home additions,... Read more
GarageSale 6.6b10 - Create outstanding e...
GarageSale is a slick, full-featured client application for the eBay online auction system. Create and manage your auctions with ease With GarageSale, you can create, edit, track, and manage... Read more
Twitter 2.2.1 - Official Twitter client...
Twitter (was Tweetie) is a Twitter client with a variety of features. Important Note: As of January 2011, AteBit's Tweetie application has been acquired and renamed by Twitter. Version 1.2.8 of the... Read more
SteerMouse 4.1.6 - Powerful third-party...
SteerMouse is an advanced driver for USB and Bluetooth mice. It also supports Apple Mighty Mouse very well. SteerMouse can assign various functions to buttons that Apple's software does not allow,... Read more
Google Chrome 27.0.1453.93 - Modern and...
Google Chrome is a Web browser by Google, created to be a modern platform for Web pages and applications. It utilizes very fast loading of Web pages and has a V8 engine, which is a custom built... Read more
Labels & Addresses 1.6.5 - Powerful...
Labels & Addresses is a home and office tool for printing all sorts of labels, envelopes, inventory labels, and price tags. Merge-printing capability makes the program a great tool for holiday... Read more
Delicious Library 3.0.2 - Import, browse...
Delicious Library allows you to import, browse, and share all your books, movies, music, and video games with Delicious Library. Run your very own library from your home or office using our... Read more
KeyCue 6.5 - Displays all menu shortcut...
KeyCue helps you to use your OS X applications more effectively. Just hold down the Command key for a while - KeyCue comes to help and shows a table of all currently available keyboard shortcuts.... Read more
HoudahSpot 3.7.8 - Advanced front-end fo...
HoudahSpot is a flexible file-search tool based on Apple's powerful Spotlight engine. Keep frequently used files within reach Retrieve the files you didn't know you still had Don't waste time... Read more

Evernote Update Keeps You Notified, Adds...
Evernote Update Keeps You Notified, Adds New Reminders Feature Posted by Andrew Stevens on May 23rd, 2013 [ permalink ] | Read more »
Clear Shakes Up A New Update: Email Your...
Clear Shakes Up A New Update: Email Your Lists Posted by Andrew Stevens on May 23rd, 2013 [ permalink ] iPhone App - Designed for the iPhone, compatible with the iPad | Read more »
Regular Show: Best Park in the Universe...
Regular Show: Best Park in the Universe Review By Carter Dotson on May 23rd, 2013 Our Rating: :: SLACKERSUniversal App - Designed for iPhone and iPad This park has some good ideas, but a lot of work needs to go into it to make it... | Read more »
Angry Birds Space Launches You Into Spac...
Angry Birds Space Launches You Into Space For Free Posted by Andrew Stevens on May 23rd, 2013 [ permalink ] iPhone App - Designed for the iPhone, compatible with the iPad | Read more »
Mailbox Shows Some Tablet Love, Gets Opt...
Mailbox Shows Some Tablet Love, Gets Optimized For iPad Posted by Andrew Stevens on May 23rd, 2013 [ permalink ] iPhone App - Designed for the iPhone, compatible with the iPad | Read more »
Ayopa Games Offers Their Titles For Free...
Ayopa Games Offers Their Titles For Free This Memorial Day Weekend Posted by Andrew Stevens on May 23rd, 2013 [ permalink ] Ayopa Games is celebrating this Mem | Read more »
Greedy Grub Review
Greedy Grub Review By Rob Rich on May 23rd, 2013 Our Rating: :: A CUTE CRAWLUniversal App - Designed for iPhone and iPad Greedy Grub is certainly adorable, but it’s not particularly ground-breaking.   | Read more »
Finger Tied Jr Review
Finger Tied Jr Review By Jennifer Allen on May 23rd, 2013 Our Rating: :: FINGER TWISTING FUNiPhone App - Designed for the iPhone, compatible with the iPad Finger Tied brought Twister-style gaming to the iPad, and Jr does much the... | Read more »
Zynga’s Battlestone – Mobile Hack ‘n’ Sl...
Zynga’s Battlestone – Mobile Hack ‘n’ Slash Arcade Action Posted by Rob LeFebvre on May 23rd, 2013 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
Developer Spotlight: Infinite Dreams
With its latest title, Can Knockdown 3, recently earning a coveted Editor’s Choice award here, I took the time to learn a bit more about Polish game developer, Infinite Dreams. Who is Infinite Dreams? Based in the Southern Polish city of Gliwice,... | Read more »

Price Scanner via MacPrices.net

Economic Conservatives Defend Apple’s Tax Strategy
Given Apple’s longtime reputation as the particular darling of the liberal lefty end of the spectrum, it’s been facinating to see mostly prominant conservatives rallying to the defense of Apple’s... Read more
Is Apple Losing Its “Cool” Cachet With The Popular...
SMH’s Steve Colquhoun notes that while Apple has again been rated as the world’s top brand this week, a leading social researcher warns the company and its products are losing touch with Generation Y... Read more
New Rugged Smartphone From…. Caterpillar?!
Bullitt Mobile Ltd., global licensee of Cat phones for Caterpillar Inc., has introduced the new Cat B15 smartphone in North America. The Cat B15 is designed to be the most progressive, durable and... Read more
Mac mini on sale for $25 off, free shipping, NY ta...
B&H Photo has the 2.5GHz Mac mini available for $574.98 including free shipping and NY sales tax only. Their price is $25 off MSRP. B&H will include free copies of Parallels Desktop and Bento... Read more
Updated iPad Price Trackers
We’ve updated our iPad Price Tracker and our iPad mini Price Tracker with the latest information on prices and availability from Apple and other resellers. Read more
Take $20 off with Apple refurbished iPod nanos
The Apple Store has Apple Certified Refurbished 16GB iPod nanos available for $129 including free shipping and Apple’s standard one-year warranty. That’s $20, or 13%, off the cost of new nanos. All... Read more
Apple TV (refurbished) available for $85, 14% off
The Apple Store has Apple Certified Refurbished 2012 Apple TVs available for $85 including free shipping. That’s $14 off the cost of new models. Apple’s one-year warranty is standard. Read more
27″ iMacs on sale for $100 off MSRP
Amazon has 27-inch iMacs on sale for $100 off MSRP: - 27″ 3.2GHz iMac: $1899.99 - 27″ 2.9GHz iMac: $1699.98 Shipping is free Read more
Platform Wars: Tablets Triumphant, But Don’t Write...
The Register’s Paul Kunert says it’s finally official – the epic battle of legendary Apple CEO Steve Jobs is finally won, now that he has toppled the PC platform from beyond the grave, in the UK, at... Read more
Apple Tops 100 Most Valuable Global Brands 2013 Su...
MarketingWeek’s Lou Cooper reports that this years BrandZ ranking of the top 100 valuable global brands sees Apple maintain its reign as number one, ahead of Google and IBM in second and third and... Read more

Jobs Board

*Apple* Retail - Manager - Apple Inc. (...
Job Summary Keeping an Apple Store thriving requires a diverse set of leadership skills, and as a Manager, you’re a master of them all. In the store’s fast-paced, Read more
*Apple* Account Executive - CompuCom (U...
Apple Account Executive Job Location US-IL-Des Plaines Posted Date 3/27/2013 Req # 2013-4905 Apply/Socialize: * Apply Now! * Email this opportunity to a friend or Read more
*Apple* - Solution Architect - CompuCom...
Apple - Solution Architect Job Location US-TX-Dallas Posted Date 4/18/2013 Req # 2013-4932 Apply/Socialize: * Apply Now! * Email this opportunity to a friend or Read more
Mac/ *Apple* Specialist Needed - Enterp...
Mac/ Apple Specialist Needed - Enterprise iPad Deployment A prominent Robert Half client is seeking out a Mac/ Apple Specialist to assist with an iPad deployment Read more
Mac/ *Apple* Specialist Needed | Enterp...
Mac/ Apple Specialist Needed | Enterprise iPad Deployment A prominent Robert Half client is seeking out a Mac/ Apple Specialist to assist with an iPad deployment Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.