TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Logic Pro X 10.1.1 - Music creation and...
Apple Logic Pro X is the most advanced version of Logic ever. Sophisticated new tools for professional songwriting, editing, and mixing are built around a modern interface that's designed to get... Read more
VLC Media Player 2.2.0 - Popular multime...
VLC Media Player is a highly portable multimedia player for various audio and video formats (MPEG-1, MPEG-2, MPEG-4, DivX, MP3, OGG, ...) as well as DVDs, VCDs, and various streaming protocols. It... Read more
Sound Studio 4.7.8 - Robust audio record...
Sound Studio lets you easily record and professionally edit audio on your Mac. Easily rip vinyls and digitize cassette tapes, or record lectures and voice memos. Prepare for live shows with live... Read more
LibreOffice 4.4.1.2 - Free, open-source...
LibreOffice is an office suite (word processor, spreadsheet, presentations, drawing tool) compatible with other major office suites. The Document Foundation is coordinating development and... Read more
Freeway Pro 7.0.3 - Drag-and-drop Web de...
Freeway Pro lets you build websites with speed and precision... without writing a line of code! With its user-oriented drag-and-drop interface, Freeway Pro helps you piece together the website of... Read more
Cloud 3.3.0 - File sharing from your men...
Cloud is simple file sharing for the Mac. Drag a file from your Mac to the CloudApp icon in the menubar and we take care of the rest. A link to the file will automatically be copied to your clipboard... Read more
Cyberduck 4.6.5 - FTP and SFTP browser....
Cyberduck is a robust FTP/FTP-TLS/SFTP browser for the Mac whose lack of visual clutter and cleverly intuitive features make it easy to use. Support for external editors and system technologies such... Read more
Firefox 36.0 - Fast, safe Web browser. (...
Firefox for Mac offers a fast, safe Web browsing experience. Browse quickly, securely, and effortlessly. With its industry-leading features, Firefox is the choice of Web development professionals and... Read more
Thunderbird 31.5.0 - Email client from M...
As of July 2012, Thunderbird has transitioned to a new governance model, with new features being developed by the broader free software and open source community, and security fixes and improvements... Read more
VOX 2.4 - Music player that supports man...
VoxIt just sounds better! The beauty is in its simplicity, yet behind the minimal exterior lies a powerful music player with a ton of features & support for all audio formats you should ever need... Read more

Get The Whole Story – Lone Wolf Complete...
Get The Whole Story – Lone Wolf Complete is Now Available and On Sale Posted by Jessica Fisher on February 27th, 2015 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
Who Wore it Best? The Counting Dead vs....
Like it or not, the “clicker” genre, popularized by cute distractions like Candy Box and Cookie Clicker, seems like it’s here to stay. So Who Wore it Best? takes a look at two recent examples: The Counting Dead and AdVenture Capitalist. | Read more »
Card Crawl, the Mini Deck Building Game,...
Card Crawl, the Mini Deck Building Game, is Coming Soon Posted by Jessica Fisher on February 27th, 2015 [ permalink ] Tinytouchtales and Mexer have announced their new game, | Read more »
Witness an all new puzzle mechanic in Bl...
Well, BlastBall MAX is not one of those games and is bucking trends such as timers, elements of randomness, and tacked-on mechanics in favor of pure puzzle gameplay. When you first boot up the game you’ll see a grid made up of squares that are each... | Read more »
This Princess Has a Dragon and She isn’t...
This Princess Has a Dragon and She isn’t Afraid to Useit. | Read more »
Mecha Showdown Review
Mecha Showdown Review By Lee Hamlet on February 27th, 2015 Our Rating: :: IN A SPINUniversal App - Designed for iPhone and iPad Mecha Showdown replaces traditional buttons with a slot machine mechanic in this robot fighting game,... | Read more »
Reliance Games and Dreamworks Unveil Rea...
Reliance Games and Dreamworks Unveil Real Steel Champions Posted by Ellis Spice on February 27th, 2015 [ permalink ] Reliance Games and Dreamworks have announced that a third game in | Read more »
Sum Idea Review
Sum Idea Review By Jennifer Allen on February 27th, 2015 Our Rating: :: TAXING NUMBERSUniversal App - Designed for iPhone and iPad Sum Idea is a fairly charming but taxing puzzle game.   | Read more »
A New Badland Update Brings Daydream Lev...
A New Badland Update Brings Daydream Levels to Co-Op Posted by Ellis Spice on February 27th, 2015 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
Slashing Demons Review
Slashing Demons Review By Lee Hamlet on February 27th, 2015 Our Rating: :: IT'S A LONG WAY TO THE TOPUniversal App - Designed for iPhone and iPad Slashing Demons lacks the depth or scope to take it beyond the point of being just... | Read more »

Price Scanner via MacPrices.net

Apple CEO Tim Cook to Deliver 2015 George Was...
Apple CEO Tim Cook will deliver the George Washington University’s Commencement address to GWU grads on May 17, at which time he will also be awarded an honorary doctorate of public service from the... Read more
Apple restocks refurbished Mac minis for up t...
The Apple Store has restocked Apple Certified Refurbished 2014 Mac minis, with models available starting at $419. Apple’s one-year warranty is included with each mini, and shipping is free: - 1.4GHz... Read more
Save up to $50 on iPad Air 2s, NY tax only, f...
 B&H Photo has iPad Air 2s on sale for $50 off MSRP including free shipping plus NY sales tax only: - 16GB iPad Air 2 WiFi: $469.99 $30 off - 64GB iPad Air 2 WiFi: $549 $50 off - 128GB iPad Air 2... Read more
16GB iPad Air 2 on sale for $447, save $52
Walmart has the 16GB iPad Air 2 WiFi on sale for $446.99 on their online store for a limited time. Choose free shipping or free local store pickup (if available). Sale price for online orders only,... Read more
iMacs on sale for up to $205 off MSRP
B&H Photo has 21″ and 27″ iMacs on sale for up to $205 off MSRP including free shipping plus NY sales tax only: - 21″ 1.4GHz iMac: $1029 $70 off - 21″ 2.7GHz iMac: $1199 $100 off - 21″ 2.9GHz... Read more
Apple Takes 89 Percent Share of Global Smartp...
According to the latest research from Strategy Analytics, global smartphone operating profit reached US$21 billion in Q4 2014. The Android operating system captured a record-low 11 percent global... Read more
New Travel Health App “My Travel Health” iOS...
Rochester, Minnesota based Travel Health and Wellness LLC has announced that its new iOS app help safeguard the user’s health when traveling abroad — “My Travel Health” is now available on the Apple... Read more
Sale! MacBook Airs for up to $115 off MSRP
B&H Photo has MacBook Airs on sale for up to $100 off MSRP. Shipping is free, and B&H charges NY sales tax only: - 11″ 128GB MacBook Air: $799 100 off MSRP - 11″ 256GB MacBook Air: $999 $100... Read more
15-inch 2.0GHz Retina MacBook Pro (refurbishe...
The Apple Store has Apple Certified Refurbished previous-generation 15″ 2.0GHz Retina MacBook Pros available for $1489 including free shipping plus Apple’s standard one-year warranty. Their price is... Read more
Wither The iPad mini? End Of The Road Imminen...
AppleDailyReport’s Dennis Sellers predicts that the iPad mini is going to be left to wither on the vine, as it were, and then just allowed to fade away — a casualty of the IPhone 6 Plus and other... Read more

Jobs Board

Sr. Technical Services Consultant, *Apple*...
**Job Summary** Apple Professional Services (APS) has an opening for a senior technical position that contributes to Apple 's efforts for strategic and transactional Read more
Event Director, *Apple* Retail Marketing -...
…This senior level position is responsible for leading and imagining the Apple Retail Team's global engagement strategy and team. Delivering an overarching brand Read more
*Apple* Pay - Site Reliability Engineer - Ap...
**Job Summary** Imagine what you could do here. At Apple , great ideas have a way of becoming great products, services, and customer experiences very quickly. Bring Read more
*Apple* Solutions Consultant - Retail Sales...
**Job Summary** The ASC is an Apple employee who serves as an Apple brand ambassador and influencer in a Reseller's store. The ASC's role is to grow Apple Read more
*Apple* Solutions Consultant - Retail Sales...
**Job Summary** As an Apple Solutions Consultant (ASC) you are the link between our customers and our products. Your role is to drive the Apple business in a retail Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.