TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 
AAPL
$100.58
Apple Inc.
+0.01
MSFT
$45.22
Microsoft Corpora
+0.27
GOOG
$583.37
Google Inc.
-1.12

MacTech Search:
Community Search:

Software Updates via MacUpdate

Pages 5.2.2 - Apple's word processo...
Apple Pages is a powerful word processor that gives you everything you need to create documents that look beautiful. And read beautifully. It lets you work seamlessly between Mac and iOS devices. And... Read more
Quicken 2015 2.0.1 - Complete personal f...
The new Quicken 2015 helps you manage all your personal finances in one place, so you can see where you're spending and where you can save. Quicken automatically categorizes your financial... Read more
CleanMyMac 2.2.7 - Delete files that was...
CleanMyMac makes space for the things you love. Sporting a range of ingenious new features, CleanMyMac 2 lets you safely and intelligently scan and clean your entire system, delete large, unused... Read more
MacFamilyTree 7.2.4 - Create and explore...
MacFamilyTree gives genealogy a facelift: it's modern, interactive, incredibly fast, and easy to use. We're convinced that generations of chroniclers would have loved to trade in their genealogy... Read more
Videobox 4.1.1 - Download Flash video th...
Videobox allows you to quickly and easily download Flash video from most all of the popular video sites on the internet. Videobox will convert the video into a native Quicktime format so it's ready... Read more
Web Snapper 3.3.5 - Capture entire Web p...
Web Snapper lets you capture Web pages exactly as they appear in your browser. You can send them to a file as images or vector-based, multi-page PDFs. It captures the whole Web page - eliminating... Read more
Picasa 3.9.138 - Organize, edit, and sha...
Picasa and Picasa Web Albums allows you to organize, edit, and upload your photos to the Web from your computer in quick, simple steps. Arrange your photos into folders and albums and erase their... Read more
Tidy Up 3.0.15.0 - Find duplicate files...
Tidy Up is a complete duplicate finder and disk-tidiness utility. With Tidy Up you can search for duplicate files and packages by the owner application, content, type, creator, extension, time... Read more
Parallels Desktop 10.0 - Run Windows app...
Parallels Desktop is simply the world's bestselling, top-rated, and most trusted solution for running Windows applications on your Mac. With Parallels Desktop for Mac, you can seamlessly run both... Read more
Apple Final Cut Pro X 10.1.3 - Professio...
Apple Final Cut Pro X is a professional video editing solution.Completely redesigned from the ground up, Final Cut Pro adds extraordinary speed, quality, and flexibility to every part of the post-... Read more

Latest Forum Discussions

See All

Pentumble (Games)
Pentumble 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: *** Special Launch Price! 33% Off! *** Pentumble combines the joy of a free-flowing platformer with beautiful graphics and levels... | Read more »
DotEmu Summer Sale Cuts Several Games’ P...
DotEmu Summer Sale Cuts Several Games’ Prices to $0.99 Posted by Ellis Spice on August 21st, 2014 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
King Announces its Newest “Saga” – Diamo...
King Announces its Newest “Saga” – Diamond Digger Saga Posted by Jessica Fisher on August 21st, 2014 [ permalink ] Diamond Digger Saga, by King Digital Entertainment, promises to be another addictive, adorable game in the “Saga” | Read more »
Spacetime Studios’ Greg Mueller Tells Al...
| Read more »
TourStar Review
TourStar Review By Rob Thomas on August 21st, 2014 Our Rating: :: BORING TOURINGUniversal App - Designed for iPhone and iPad When The Ramones sang “Touring, touring/It’s never boring,” they clearly weren’t referring to TourStar.... | Read more »
Track Your Pregnancy with Glow Nurture
Track Your Pregnancy with Glow Nurture Posted by Jessica Fisher on August 21st, 2014 [ permalink ] iPhone App - Designed for the iPhone, compatible with the iPad | Read more »
Outcast Odyssey – An Interview with the...
At this year’s San Diego Comic-Con, fans got the chance to meet and mingle with several of the artists behind Magic Pixel Games and Namco Bandai’s upcoming card battler, Outcast Odyssey. Considering many of these artists have worked on comics in the... | Read more »
BoxPop Review
BoxPop Review By Jennifer Allen on August 21st, 2014 Our Rating: :: UNTAXING PUZZLESUniversal App - Designed for iPhone and iPad Move from tile to tile to clear the board in this slightly too simple and easy puzzle game.   | Read more »
Breakdown Buddy Review
Breakdown Buddy Review By Jennifer Allen on August 21st, 2014 Our Rating: :: FLAWED CONCEPTiPhone App - Designed for the iPhone, compatible with the iPad There’s a good idea here, but Breakdown Buddy isn’t as useful as it tries to... | Read more »
jamstik, the Guitar You Can Take Everywh...
jamstik, the Guitar You Can Take Everywhere, is Now Available for Purchase Posted by Jessica Fisher on August 21st, 2014 [ permalink ] Zivix has announced that their guitar-in-your-pocket, | Read more »

Price Scanner via MacPrices.net

Pro.Calendar – New Productivity App for iPad...
Austin, Texas based mobile business and productivity app developer LightArrow, Inc. has announced Pro.Calendar, a powerful and intuitive calendar app with eight versatile calendar options including... Read more
SanDisk Ultra II SSD — Supercharge Your Syste...
SanDisk Corporation has announced the new SanDisk Ultra II SSD with enhanced SSD Dashboard. The new drive is designed to deliver a cost-effective and easy upgrade solution for PC owners looking to... Read more
Samsung and Barnes & Noble Introduce New...
Samsung Electronics America and NOOK Media, a subsidiary of Barnes & Noble, Inc. have announced the introduction of the new Samsung Galaxy Tab 4 NOOK, a 7-inch tablet combining Samsung’s leading... Read more
21-inch iMacs on sale for up to $150 off MSRP
B&H Photo has 21″ iMacs on sale for up to $150 off MSRP including free shipping plus NY sales tax only. B&H will also include a free copy of Parallels Desktop software: - 21″ 2.7GHz iMac: $... Read more
27-inch 3.2GHz iMac on sale for $1698, save $...
Abt has the 27″ 3.2GHz iMac on sale for $1698 including free shipping. Their price is $101 off MSRP. Read more
Mac Backup Guru 2.0 Drive Backup/Cloneing Uti...
Mac Backup Guru developer MacDaddy has released Mac Backup Guru 2.0, offering new and enhanced advanced features, such as bootable backups, synchronised volumes and folders, and a Snapshot mode that... Read more
Operate GE’s New Free-Standing KItchen Range...
Think you accidentally left the oven on? Switch it off while on the go. The new free-standing Profile™ Series gas and electric ranges are GE’s second cooking appliances, following their wall oven, to... Read more
Parallels Announces Parallels Desktop 10 for...
The no. 1-selling software for running Windows applications on a Mac becomes an even easier choice for millions of consumers and IT professionals worldwide with the launch of the most powerful... Read more
Apple now offering certified refurbished 2014...
 The Apple Store is now offering Apple Certified Refurbished 2014 MacBook Airs for up to $180 off the cost of new models. An Apple one-year warranty is included with each MacBook, and shipping is... Read more
Best Buy’s College Student Deals: $100 off Ma...
Take an additional $100 off all MacBooks and iMacs, $50 off iPad Airs and iPad minis, at Best Buy Online with their College Students Deals Savings, valid through September 6th. Anyone with a valid .... Read more

Jobs Board

*Apple* Retail - Multiple Positions (US) - A...
Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the Read more
Position Opening at *Apple* - Apple (United...
**Job Summary** As more and more people discover Apple , they visit our stores seeking ways to incorporate our products into their lives. It's your job, as a Store Read more
Position Opening at *Apple* - Apple (United...
…Summary** As a Specialist, you help create the energy and excitement around Apple products, providing the right solutions and getting products into customers' hands. You Read more
Project Manager, *Apple* Financial Services...
**Job Summary** Apple Financial Services (AFS) offers consumers, businesses and educational institutions ways to finance Apple purchases. We work with national and Read more
*Apple* Solutions Consultant (ASC)- Retail S...
**Job Summary** The ASC is an Apple employee who serves as an Apple brand ambassador and influencer in a Reseller's store. The ASC's role is to grow Apple Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.