TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Microsoft Office 2016 15.31 - Popular pr...
Microsoft Office 2016 - Unmistakably Office, designed for Mac. The new versions of Word, Excel, PowerPoint, Outlook and OneNote provide the best of both worlds for Mac users - the familiar Office... Read more
Spotify 1.0.49.125. - Stream music, crea...
Spotify is a streaming music service that gives you on-demand access to millions of songs. Whether you like driving rock, silky R&B, or grandiose classical music, Spotify's massive catalogue puts... Read more
QuickBooks 16.1.12.1564 R13 - Financial...
QuickBooks helps you manage your business easily and efficiently. Organize your finances all in one place, track money going in and out of your business, and spot areas where you can save. Built for... Read more
Dash 4.0.1 - Instant search and offline...
Dash is an API documentation browser and code snippet manager. Dash helps you store snippets of code, as well as instantly search and browse documentation for almost any API you might use (for a full... Read more
Tinderbox 7.0.0 - Store and organize you...
Tinderbox is a personal content management assistant. It stores your notes, ideas, and plans. It can help you organize and understand them. And Tinderbox helps you share ideas through Web journals... Read more
Apple Remote Desktop Client 3.9 - Client...
Apple Remote Desktop Client is the best way to manage the Mac computers on your network. Distribute software, provide real-time online help to end users, create detailed software and hardware reports... Read more
Sparkle 2.1.1 - $79.99
Sparkle will change your mind if you thought building websites wasn't for you. Sparkle is the intuitive site builder that lets you create sites for your online portfolio, team or band pages, or... Read more
1Password 6.5.5 - Powerful password mana...
1Password is a password manager that uniquely brings you both security and convenience. It is the only program that provides anti-phishing protection and goes beyond password management by adding Web... Read more
WhatRoute 2.0.15 - Geographically trace...
WhatRoute is designed to find the names of all the routers an IP packet passes through on its way from your Mac to a destination host. It also measures the round-trip time from your Mac to the router... Read more
Art Text 3.2.2 - $49.99
Art Text is graphic design software specifically tuned for lettering, typography, text mockups and various artistic text effects. Supplied with a great variety of ready to use styles and materials,... Read more

Tiny Striker: World Football Guide - How...
| Read more »
Good news everyone! Futurama: Worlds of...
Futurama is finding a new home on mobile in TinyCo and Fox Interactive's new game, Futurama: Worlds of Tomorrow. They're really doing it up, bringing on board Futurama creator Matt Groening along with the original cast and writers. TinyCo wants... | Read more »
MUL.MASH.TAB.BA.GAL.GAL (Games)
MUL.MASH.TAB.BA.GAL.GAL 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: ENDLESS UPGRADES. CONSTANT DANGER. ANCIENT WISDOM. BOUNCY BALLS. Launch Sale, 40% OFF for a very limited time!!! MUL.... | Read more »
Dungeon Rushers (Games)
Dungeon Rushers 1.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: Dungeon Rushers is a 2D tactical RPG combining dungeon crawler’s gameplay and turn based fights. Manage your team, loot dusty... | Read more »
Blasty Bubs is a colorful Pinball and Br...
QuickByte Games has another arcade treat in the works -- this time it's a mishmash of brick breaking and Pinball mechanics. It's called Blasty Bubs, and it's a top down brickbreaker that has you slinging balls around a board. [Read more] | Read more »
Corsola and Heracross are the new region...
Generation 2 finally launched in Pokémon GO, unleashing a brand new batch of Pokémon into the wild. Even before the update went live people were speculating on how to catch elusive Pokémon like the legendary "dogs", Unknown, and whether or not... | Read more »
The Warlock of Firetop Mountain (Games)
The Warlock of Firetop Mountain 1.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: An epic adventure through a mysterious mountain filled with monsters, magic and mayhem! “...it looks downright... | Read more »
Fantasy MMORPG MU Origin’s receives a hu...
Developer Webzen are looking to take their highly popular fantasy battler MU Origin to the next level this month, with its most ambitious overhaul yet. The latest update introduces the long sought after Server Arena, new treasure dungeons, and much... | Read more »
RPG Djinn Caster (Games)
RPG Djinn Caster 1.0.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0.0 (iTunes) Description: SPECIAL PRICE 38% OFF(USD 7.99 -> USD 4.99)!!!A Fantasy Action RPG of far foreign lands! Summon the Djinns and rise to... | Read more »
Alto's Odyssey gets its first trail...
There's finally video evidence of Alto's Odyssey, the follow up to the 2015 App Store hit, Alto's Adventure. It looks just as soothing and atmospheric as Alto's last outing, but this time players will be journeying to the desert. Whereas Alto's... | Read more »

Price Scanner via MacPrices.net

Apple’s New iPad Ads Don’t Address Pro Users’...
Apple launched a new tranche of iPad Pro TV ads last week addressing actual queries and challenges from the Twitterverse, albeit using actors for the visuals. That’s great. As an iPad fan and heavy... Read more
Free Verbum Catholic Bible Study App For iOS
The Verbum mobile app runs on Logos’ powerful Bible software and is an advanced resource for mobile Catholic study. The Verbum app surrounds the Bible with the Tradition. Verbum comes with 15 free... Read more
27-inch Apple iMacs on sale for up to $200 of...
B&H Photo has 27″ Apple iMacs on sale for up to $200 off MSRP, each including free shipping plus NY sales tax only: - 27″ 3.3GHz iMac 5K: $2099.99 $200 off MSRP - 27″ 3.2GHz/1TB Fusion iMac 5K: $... Read more
15-inch 2.2GHz Retina MacBook Pro on sale for...
Amazon has 2015 15″ 2.2GHz Retina MacBook Pros (MJLQ2LL/A) available for $1849.99 including free shipping. Apple charges $1999 for this model, so Amazon’s price is represents a $150 savings. Read more
Apple refurbished iPad Air 2s available start...
Apple has Certified Refurbished iPad Air 2 WiFis available for starting at $319 including free shipping. A standard Apple one-year warranty is included: - 16GB iPad Air 2 WiFi: $319 $60 off original... Read more
Apple refurbished iPad Pros available for up...
Apple has Certified Refurbished 9″ and 12″ Apple iPad Pros available for up to $160 off the cost of new iPads. An Apple one-year warranty is included with each model, and shipping is free: - 32GB 9″... Read more
Apple restocks refurbished 2015 and 2016 13-i...
Apple has Certified Refurbished 2015 and 2016 13″ MacBook Airs available starting at $759. An Apple one-year warranty is included with each MacBook, and shipping is free: - 2016 13″ 1.6GHz/8GB/128GB... Read more
13-inch 2.5GHz MacBook Pro (Apple refurbished...
Apple has Certified Refurbished 13″ 2.5GHz MacBook Pros (MD101LL/A) available for $829, or $270 off original MSRP. Apple’s one-year warranty is standard, and shipping is free: - 13″ 2.5GHz MacBook... Read more
QuickerTek Announces 5TB Apple AC AirPort Tim...
QuickerTek Inc. has announced their new 5TB hard drive upgrade for Apple’s AC AirPort Time Capsule. By customer request, this upgrade also features six external antennas and offers the highest... Read more
Apple Certified Refurbished iMacs available f...
Apple has Certified Refurbished 2015 21″ & 27″ iMacs available for up to $350 off MSRP. Apple’s one-year warranty is standard, and shipping is free. The following models are available: - 21″ 3.... Read more

Jobs Board

*Apple* Retail - Multiple Positions - Apple,...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
Manager *Apple* Systems Administration - Pu...
Req ID 3315BR Position Title Manager, Apple Systems Administration Job Description The Manager of Apple Systems Administration oversees the administration and Read more
*Apple* Retail - Multiple Positions - Apple,...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
Manager *Apple* Systems Administration - Pu...
Req ID 3315BR Position Title Manager, Apple Systems Administration Job Description The Manager of Apple Systems Administration oversees the administration and Read more
*Apple* Technician - nfrastructure (United S...
Let’s Work Together Apple Technician This position is based in Portland, ME Life at nfrastructure At nfrastructure, we understand that our success results from our Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.