TweetFollow Us on Twitter

June 94 - BALANCE OF POWER

BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

[IMAGE 055-057_Balance_of_Power1.GIF]

When you convert your applications to native PowerPC code, they run lightning fast. To get the most out of RISC processors, however, you need to pay close attention to your code structure and execution. Fast code is no longer measured solely by an instruction timing table. The Power PC 601 processor includes pipelining, multi-issue and speculative execution, branch prediction, and a set associative cache. All these things make it hard to know what code will run fastest on a Power Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good optimizing compiler to help you. In this column I'll pass on some of what I've learned about tuning Power PC code. There are gotchas and coding habits to avoid, and there are techniques for squeezing the most from your speed-critical native code. For a good introduction to RISC pipelining and related concepts that appear in this column, see "Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every machine clock cycle, but RISC processors can do this only in the best of circumstances. At their worst they're as slow as CISC processors. The following loop, for example, averages only one calculation every 2.8 cycles:

float a[], b[], c[], d, e;
for (i=0; i < gArraySize; i++) {
  e = b[i] + c[i] / d;
  a[i] = MySubroutine(b[i], e);
}

By restructuring the code and using other techniques from this column, you can make significant improvements. This next loop generates the same result, yet averages one calculation every 1.9 cycles -- about 50% faster.

reciprocalD = 1 / d;
for (i=0; i < gArraySize; i+=2) {
  float result, localB, localC, localE;
  float result2, localB2, localC2, localE2;

  localB = b[i];
  localC = c[i];
  localB2 = b[i+1];
  localC2 = c[i+1];

  localE = localB + (localC * reciprocalD);
  localE2 = localB2 + (localC2 * reciprocalD);
  InlineSubroutine(&result, localB, localE);
  InlineSubroutine(&result2, localB2, localE2);

  a[i] = result;
  a[i+1] = result2;
}

The rest of this column explains the techniques I just used for that speed gain. They include expanding loops, scoping local variables, using inline routines, and using faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its point of view. You should understand how it looks at your code and what assumptions and optimizations it's allowed to make. The more you empathize with your compiler, the more you'll recognize opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code line by line usually isn't optimal, because the processor stalls to wait for dependent instructions. The compiler tries to move instr uctions that are independent into the stall points. For example, consider this code:

first = input * numerator;
second = first / denominator;
output = second + adjustment;

Each line depends on the previous line's result, and the compiler will be hard pressed to keep the pipeline full of useful work. This simple example could cause 46 stalled cycles on the PowerPC 601, so the compiler will look at other nearby code for independent instructions to move into the stall points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance in several ways. Loop expanding is one of the simplest methods. The idea is to perform more than one independent operation in a loop, so that the compiler can reorder more work in the pipeline and thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

float a[], b[], c[], d;
for (i=0; i < multipleOfThree; i++) {
  a[i] = b[i] + c[i] * d;
}

If we know the data always occurs in certain sized increments, we can do more steps in each iteration, as in the following:

for (i=0; i < multipleOfThree; i+=3) {
  a[i] = b[i] + c[i] * d;
  a[i+1] = b[i+1] + c[i+1] * d;
  a[i+2] = b[i+2] + c[i+2] * d;
}

On a CISC processor the second loop wouldn't be much faster, but on the Power PC processor the second loop is twice as fast as the first. This is because the compiler can schedule independent instructions to keep the pipeline constantly moving. (If the data doesn't occur in nice increments, you can still expand the loop; just add a small loop at the end to handle the extra iterations.)Be careful not to expand a loop too much, though. Very large loops won't fit in the cache, causing cache misses for each iteration. In addition, the larger a loop gets, the less work can be done entirely in registers. Expand too much and the compiler will have to use memory  to store intermediate results, outweighing your marginal gains. Besides, you get the biggest gains from the first few expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the PowerPC chip -- 32 general registers and 32 floating-point registers. By having so many, the processor can often avoid slow memory operations. Your compiler will take advantage of this when it can, but you can help it by carefully scoping your variables and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler examines the scope of each variable when it schedules registers, and your code can provide valuable information about the usage of each variable. Here's an example:

for (i=0; i < gArraySize; i++) {
  a[i] = MyFirstRoutine(b[i], c[i]);
  b[i] = MySecondRoutine(a[i], c[i]);
} 

In this loop, the global variable gArraySize is scoped for the whole program. Because we call a subroutine in the loop, the compiler can't tell if gArraySize will change during each iteration. Since the subroutine might modify gArraySize, the compiler has to be conservative. It will reload gArraySize from memory on every iteration, and it won't optimize the loop any further. This is wastefully slow.

On the other hand, if we use a local  variable, we tell the compiler that gArraySize and c[i] won't be modified and that it's all right to just keep them handy in registers. In addition, we can store data as temporary variables scoped only within the loop. This tells the compiler how we intend to use the data, so that the compiler can use free registers and discard them after the loop. Here's what this would look like:

arraySize = gArraySize;
for (i=0; i < arraySize; i++) {
  float localC;
  localC = c[i];
  a[i] = MyFirstRoutine(b[i], localC);
  b[i] = MySecondRoutine(a[i], localC);
} 

These minor changes give the compiler more information about the data, in this instance accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated dereferences and typecasting, chances are the object code has wasteful memory instructions and inefficient register usage. A great compiler might optimize well anyway, but don't count on it. Judicious use of temporary variables (as mentioned above) will help the compiler understand exactly what you're doing -- plus your code will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles on the Macintosh. Code often contains double memory dereferences, which is important when memory can move. But when you can guarantee that memory won't  move, use a local pointer, so that you only dereference a handle once. This saves load instructions and allows fur ther optimizations. Casting data types is usually a free operation -- you're just telling the compiler that you know you're copying seemingly incompatible data. But it's not  free if the data types have different bit sizes, which adds conversion instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true that often the pipeline can keep moving even though a branch is encountered, because the branch execution unit will try to resolve branches very early in the pipeline or will predict the direction of the branch. Still, the more subroutines you have, the less your compiler will be able to reorder and intelligently schedule instructions. Keep speed-critical code together, so that more of it can be pipelined and the compiler can schedule your registers better. Use inline routines for short operations, as I did in the improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know about. Some are processor model specific. For example, the PowerPC 601 has 32K of cache, while the 603 has 16K split evenly into an instruction cache and a data cache. But in general you should know about floating-point performance and the virtues of memory alignment.

Floating-point multiplication is wicked fast -- up to nine times  the speed of integer multiplication. Use floating-point multiplication if you can. Floating-point division takes 17 times as long, so when possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to unaligned data stall while the processor loads different words and then shifts and splices them. For example, be sure to align floating-point data to 64-bit boundaries, or you'll stall for four cycles while the processor loads 32-bit halves with two 64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about tweaking its performance at all. For your speed-critical code, though, these tips I've given you can make the difference between "too slow" and "fast enough."

RECOMMENDED READING

  • High-Performance Computing  by Kevin Dowd (O'Reilly & Associates, Inc., 1993).
  • High-Performance Computer Architecture  by Harold S. Stone (Addison-Wesley, 1993).
  • PowerPC 601 RISC Microprocessor User's Manual (Motorola, 1993).

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's been repeatedly thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle. Fixing engine stalls, poor timing, and rough starts proved difficult, but he was recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization option, and then run on a Power Macintosh 6100/66 for profiling. A PowerPC 601 microsecond timing library is provided on this issue's CD. *

 
AAPL
$105.22
Apple Inc.
+0.39
MSFT
$46.13
Microsoft Corpora
+1.11
GOOG
$539.78
Google Inc.
-4.20

MacTech Search:
Community Search:

Software Updates via MacUpdate

OS X Server 4.0 - For OS X 10.10 Yosemit...
Designed for OS X and iOS devices, OS X Server makes it easy to share files, schedule meetings, synchronize contacts, develop software, host your own website, publish wikis, configure Mac, iPhone,... Read more
TotalFinder 1.6.12 - Adds tabs, hotkeys,...
TotalFinder is a universally acclaimed navigational companion for your Mac. Enhance your Mac's Finder with features so smart and convenient, you won't believe you ever lived without them. Tab-based... Read more
BusyCal 2.6.3 - Powerful calendar app wi...
BusyCal is an award-winning desktop calendar that combines personal productivity features for individuals with powerful calendar sharing capabilities for families and workgroups. BusyCal's unique... Read more
calibre 2.7 - Complete e-library managem...
Calibre is a complete e-book library manager. Organize your collection, convert your books to multiple formats, and sync with all of your devices. Let Calibre be your multi-tasking digital... Read more
Skitch 2.7.3 - Take screenshots, annotat...
With Skitch, taking, annotating, and sharing screenshots or images is as fun as it is simple.Communicate and collaborate with images using Skitch and its intuitive, engaging drawing and annotating... Read more
Delicious Library 3.3.2 - Import, browse...
Delicious Library allows you to import, browse, and share all your books, movies, music, and video games with Delicious Library. Run your very own library from your home or office using our... Read more
Art Text 2.4.8 - Create high quality hea...
Art Text is an OS X application for creating high quality textual graphics, headings, logos, icons, Web site elements, and buttons. Thanks to multi-layer support, creating complex graphics is no... Read more
Live Interior 3D Pro 2.9.6 - Powerful an...
Live Interior 3D Pro is a powerful yet very intuitive interior designing application. View Video Tutorials It has every feature of Live Interior 3D Standard, plus some exclusive ones: Create multi... Read more
The Hit List 1.1.7 - Advanced reminder a...
The Hit List manages the daily chaos of your modern life. It's easy to learn - it's as easy as making lists. And it's powerful enough to let you plan, then forget, then act when the time is right.... Read more
jAlbum Pro 12.2.4 - Organize your digita...
jAlbum Pro has all the features you love in jAlbum, but comes with a commercial license. With jAlbum, you can create gorgeous custom photo galleries for the Web without writing a line of code!... Read more

Latest Forum Discussions

See All

Rami Ismail Opens Up distribute​() for D...
Rami Ismail Opens Up distribute​() for Developers Posted by Jessica Fisher on October 24th, 2014 [ permalink ] Rami Ismail, Chief Executive of Business and Development at indie game studio | Read more »
Great Hitman GO Goes on Sale and Gets Ne...
Great Hitman GO Goes on Sale and Gets New Update – Say That Three Times Fast Posted by Jessica Fisher on October 24th, 2014 [ permalink ] | Read more »
Rival Stars Basketball Review
Rival Stars Basketball Review By Jennifer Allen on October 24th, 2014 Our Rating: :: RESTRICTIVE BUT FUNUniversal App - Designed for iPhone and iPad Rival Stars Basketball is a fun mixture of basketball and card collecting but its... | Read more »
Rubicon Development Makes Over a Dozen o...
Rubicon Development Makes Over a Dozen of Their Games Free For This Weekend Only Posted by Jessica Fisher on October 24th, 2014 [ permalink ] | Read more »
I Am Dolphin Review
I Am Dolphin Review By Jennifer Allen on October 24th, 2014 Our Rating: :: NEARLY FIN-TASTICUniversal App - Designed for iPhone and iPad Swim around and eat nearly everything that moves in I Am Dolphin, a fun Ecco-ish kind of game... | Read more »
nPlayer looks to be the ultimate choice...
Developed by Newin Inc, nPlayer may seem like your standard video player – but is aiming to be the best in its field by providing high quality video play performance and support for a huge number of video formats and codecs. User reviews include... | Read more »
Fighting Fantasy: Caverns of the Snow Wi...
Fighting Fantasy: Caverns of the Snow Witch Review By Jennifer Allen on October 24th, 2014 Our Rating: :: CLASSY STORYTELLINGUniversal App - Designed for iPhone and iPad Fighting Fantasy: Caverns of the Snow Witch is a sterling... | Read more »
A Few Days Left (Games)
A Few Days Left 1.01 Device: iOS Universal Category: Games Price: $3.99, Version: 1.01 (iTunes) Description: Screenshots are in compliance to App Store's 4+ age rating! Please see App Preview for real game play! **Important: Make... | Read more »
Toca Boo (Education)
Toca Boo 1.0.2 Device: iOS Universal Category: Education Price: $2.99, Version: 1.0.2 (iTunes) Description: BOO! Did I scare you!? My name is Bonnie and my family loves to spook! Do you want to scare them back? Follow me and I'll... | Read more »
Intuon (Games)
Intuon 1.1 Device: iOS Universal Category: Games Price: $.99, Version: 1.1 (iTunes) Description: Join the battle with your intuition in a new hardcore game Intuon! How well do you trust your intuition? Can you find a needle in a... | Read more »

Price Scanner via MacPrices.net

Weekend sale: 13-inch 128GB MacBook Air for $...
Best Buy has the 2014 13-inch 1.4GHz 128GB MacBook Air on sale for $849.99, or $150 off MSRP, on their online store. Choose free home shipping or free local store pickup (if available). Price valid... Read more
Nimbus Note Cross=Platform Notes Utility
Nimbus Note will make sure you never forget or lose your valuable data again. Create and edit notes, save web pages, screenshots and any other type of data – and share it all with your friends and... Read more
NewerTech’s Snuglet Makes MagSafe 2 Power Con...
NewerTech has introduced the Snuglet, a precision-manufactured ring designed to sit inside your MagSafe 2 connector port, providing a more snug fit to prevent your power cable from unintentional... Read more
Apple Planning To Sacrifice Gross Margins To...
Digitimes Research’s Jim Hsiao says its analysts believe Apple is planning to sacrifice its gross margins to save its tablet business, which has recently fallen into decline. They project that Apple’... Read more
Who’s On Now? – First Instant-Connect Search...
It’s nighttime and your car has broken down on the side of the highway. You need a tow truck right away, so you open an app on your iPhone, search for the closest tow truck and send an instant... Read more
13-inch 2.5GHz MacBook Pro on sale for $949,...
Best Buy has the 13″ 2.5GHz MacBook Pro available for $949.99 on their online store. Choose free shipping or free instant local store pickup (if available). Their price is $150 off MSRP. Price is... Read more
Save up to $125 on Retina MacBook Pros
B&H Photo has the new 2014 13″ and 15″ Retina MacBook Pros on sale for up to $125 off MSRP. Shipping is free, and B&H charges NY sales tax only. They’ll also include free copies of Parallels... Read more
Apple refurbished Time Capsules available sta...
The Apple Store has certified refurbished Time Capsules available for up to $60 off MSRP. Apple’s one-year warranty is included with each Time Capsule, and shipping is free: - 2TB Time Capsule: $255... Read more
Textilus New Word, Notes and PDF Processor fo...
Textilus is new word-crunching, notes, and PDF processor designed exclusively for the iPad. I haven’t had time to thoroughly check it out yet, but it looks great and early reviews are positive.... Read more
WD My Passport Pro Bus-Powered Thunderbolt RA...
WD’s My Passport Pro RAID solution is powered by an integrated Thunderbolt cable for true portability and speeds as high as 233 MB/s. HighlightsOverviewSpecifications Transfer, Back Up And Edit In... Read more

Jobs Board

*Apple* Solutions Consultant - Apple Inc. (U...
…important role that the ASC serves is that of providing an excellent Apple Customer Experience. Responsibilities include: * Promoting Apple products and solutions Read more
Senior Event Manager, *Apple* Retail Market...
…This senior level position is responsible for leading and imagining the Apple Retail Team's global event strategy. Delivering an overarching brand story; in-store, Read more
*Apple* Solutions Consultant (ASC) - Apple (...
**Job Summary** The ASC is an Apple employee who serves as an Apple brand ambassador and influencer in a Reseller's store. The ASC's role is to grow Apple Read more
Project Manager / Business Analyst, WW *Appl...
…a senior project manager / business analyst to work within our Worldwide Apple Fulfillment Operations and the Business Process Re-engineering team. This role will work Read more
*Apple* Retail - Multiple Positions (US) - A...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.