TweetFollow Us on Twitter

June 94 - BALANCE OF POWER

BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

[IMAGE 055-057_Balance_of_Power1.GIF]

When you convert your applications to native PowerPC code, they run lightning fast. To get the most out of RISC processors, however, you need to pay close attention to your code structure and execution. Fast code is no longer measured solely by an instruction timing table. The Power PC 601 processor includes pipelining, multi-issue and speculative execution, branch prediction, and a set associative cache. All these things make it hard to know what code will run fastest on a Power Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good optimizing compiler to help you. In this column I'll pass on some of what I've learned about tuning Power PC code. There are gotchas and coding habits to avoid, and there are techniques for squeezing the most from your speed-critical native code. For a good introduction to RISC pipelining and related concepts that appear in this column, see "Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every machine clock cycle, but RISC processors can do this only in the best of circumstances. At their worst they're as slow as CISC processors. The following loop, for example, averages only one calculation every 2.8 cycles:

float a[], b[], c[], d, e;
for (i=0; i < gArraySize; i++) {
  e = b[i] + c[i] / d;
  a[i] = MySubroutine(b[i], e);
}

By restructuring the code and using other techniques from this column, you can make significant improvements. This next loop generates the same result, yet averages one calculation every 1.9 cycles -- about 50% faster.

reciprocalD = 1 / d;
for (i=0; i < gArraySize; i+=2) {
  float result, localB, localC, localE;
  float result2, localB2, localC2, localE2;

  localB = b[i];
  localC = c[i];
  localB2 = b[i+1];
  localC2 = c[i+1];

  localE = localB + (localC * reciprocalD);
  localE2 = localB2 + (localC2 * reciprocalD);
  InlineSubroutine(&result, localB, localE);
  InlineSubroutine(&result2, localB2, localE2);

  a[i] = result;
  a[i+1] = result2;
}

The rest of this column explains the techniques I just used for that speed gain. They include expanding loops, scoping local variables, using inline routines, and using faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its point of view. You should understand how it looks at your code and what assumptions and optimizations it's allowed to make. The more you empathize with your compiler, the more you'll recognize opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code line by line usually isn't optimal, because the processor stalls to wait for dependent instructions. The compiler tries to move instr uctions that are independent into the stall points. For example, consider this code:

first = input * numerator;
second = first / denominator;
output = second + adjustment;

Each line depends on the previous line's result, and the compiler will be hard pressed to keep the pipeline full of useful work. This simple example could cause 46 stalled cycles on the PowerPC 601, so the compiler will look at other nearby code for independent instructions to move into the stall points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance in several ways. Loop expanding is one of the simplest methods. The idea is to perform more than one independent operation in a loop, so that the compiler can reorder more work in the pipeline and thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

float a[], b[], c[], d;
for (i=0; i < multipleOfThree; i++) {
  a[i] = b[i] + c[i] * d;
}

If we know the data always occurs in certain sized increments, we can do more steps in each iteration, as in the following:

for (i=0; i < multipleOfThree; i+=3) {
  a[i] = b[i] + c[i] * d;
  a[i+1] = b[i+1] + c[i+1] * d;
  a[i+2] = b[i+2] + c[i+2] * d;
}

On a CISC processor the second loop wouldn't be much faster, but on the Power PC processor the second loop is twice as fast as the first. This is because the compiler can schedule independent instructions to keep the pipeline constantly moving. (If the data doesn't occur in nice increments, you can still expand the loop; just add a small loop at the end to handle the extra iterations.)Be careful not to expand a loop too much, though. Very large loops won't fit in the cache, causing cache misses for each iteration. In addition, the larger a loop gets, the less work can be done entirely in registers. Expand too much and the compiler will have to use memory  to store intermediate results, outweighing your marginal gains. Besides, you get the biggest gains from the first few expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the PowerPC chip -- 32 general registers and 32 floating-point registers. By having so many, the processor can often avoid slow memory operations. Your compiler will take advantage of this when it can, but you can help it by carefully scoping your variables and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler examines the scope of each variable when it schedules registers, and your code can provide valuable information about the usage of each variable. Here's an example:

for (i=0; i < gArraySize; i++) {
  a[i] = MyFirstRoutine(b[i], c[i]);
  b[i] = MySecondRoutine(a[i], c[i]);
} 

In this loop, the global variable gArraySize is scoped for the whole program. Because we call a subroutine in the loop, the compiler can't tell if gArraySize will change during each iteration. Since the subroutine might modify gArraySize, the compiler has to be conservative. It will reload gArraySize from memory on every iteration, and it won't optimize the loop any further. This is wastefully slow.

On the other hand, if we use a local  variable, we tell the compiler that gArraySize and c[i] won't be modified and that it's all right to just keep them handy in registers. In addition, we can store data as temporary variables scoped only within the loop. This tells the compiler how we intend to use the data, so that the compiler can use free registers and discard them after the loop. Here's what this would look like:

arraySize = gArraySize;
for (i=0; i < arraySize; i++) {
  float localC;
  localC = c[i];
  a[i] = MyFirstRoutine(b[i], localC);
  b[i] = MySecondRoutine(a[i], localC);
} 

These minor changes give the compiler more information about the data, in this instance accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated dereferences and typecasting, chances are the object code has wasteful memory instructions and inefficient register usage. A great compiler might optimize well anyway, but don't count on it. Judicious use of temporary variables (as mentioned above) will help the compiler understand exactly what you're doing -- plus your code will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles on the Macintosh. Code often contains double memory dereferences, which is important when memory can move. But when you can guarantee that memory won't  move, use a local pointer, so that you only dereference a handle once. This saves load instructions and allows fur ther optimizations. Casting data types is usually a free operation -- you're just telling the compiler that you know you're copying seemingly incompatible data. But it's not  free if the data types have different bit sizes, which adds conversion instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true that often the pipeline can keep moving even though a branch is encountered, because the branch execution unit will try to resolve branches very early in the pipeline or will predict the direction of the branch. Still, the more subroutines you have, the less your compiler will be able to reorder and intelligently schedule instructions. Keep speed-critical code together, so that more of it can be pipelined and the compiler can schedule your registers better. Use inline routines for short operations, as I did in the improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know about. Some are processor model specific. For example, the PowerPC 601 has 32K of cache, while the 603 has 16K split evenly into an instruction cache and a data cache. But in general you should know about floating-point performance and the virtues of memory alignment.

Floating-point multiplication is wicked fast -- up to nine times  the speed of integer multiplication. Use floating-point multiplication if you can. Floating-point division takes 17 times as long, so when possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to unaligned data stall while the processor loads different words and then shifts and splices them. For example, be sure to align floating-point data to 64-bit boundaries, or you'll stall for four cycles while the processor loads 32-bit halves with two 64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about tweaking its performance at all. For your speed-critical code, though, these tips I've given you can make the difference between "too slow" and "fast enough."

RECOMMENDED READING

  • High-Performance Computing  by Kevin Dowd (O'Reilly & Associates, Inc., 1993).
  • High-Performance Computer Architecture  by Harold S. Stone (Addison-Wesley, 1993).
  • PowerPC 601 RISC Microprocessor User's Manual (Motorola, 1993).

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's been repeatedly thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle. Fixing engine stalls, poor timing, and rough starts proved difficult, but he was recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization option, and then run on a Power Macintosh 6100/66 for profiling. A PowerPC 601 microsecond timing library is provided on this issue's CD. *

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

FileZilla 3.23.0.2 - Fast and reliable F...
FileZilla (ported from Windows) is a fast and reliable FTP client and server with lots of useful features and an intuitive interface. Version 3.23.0.2: Bug Fixes and Minor Changes Speed up icon... Read more
PDFpen 8.3 - $74.95
PDFpen allows users to easily edit PDF's. Add text, images and signatures. Fill out PDF forms. Merge or split PDF documents. Reorder and delete pages. Even correct text and edit graphics! Features... Read more
TunnelBear 3.0.8 - Subscription-based pr...
TunnelBear is a subscription-based virtual private network (VPN) service and companion app, enabling you to browse the internet privately and securely. Features Browse privately - Secure your data... Read more
Safari Technology Preview 10.1 - The new...
Safari Technology Preview contains the most recent additions and improvements to WebKit and the latest advances in Safari web technologies. And once installed, you will receive notifications of... Read more
Ableton Live 9.7.1 - Record music using...
Ableton Live lets you create and record music on your Mac. Use digital instruments, pre-recorded sounds, and sampled loops to arrange, produce, and perform your music like never before. Ableton Live... Read more
BetterTouchTool 1.963 - Customize Multi-...
BetterTouchTool adds many new, fully customizable gestures to the Magic Mouse, Multi-Touch MacBook trackpad, and Magic Trackpad. These gestures are customizable: Magic Mouse: Pinch in / out (zoom... Read more
NeoFinder 7.0 - Catalog your external me...
NeoFinder (formerly CDFinder) rapidly organizes your data, either on external or internal disks, or any other volumes. It catalogs all your data, so you stay in control of your data archive or disk... Read more
Coda 2.6 - One-window Web development su...
Coda is a powerful Web editor that puts everything in one place. An editor. Terminal. CSS. Files. With Coda 2, we went beyond expectations. With loads of new, much-requested features, a few surprises... Read more
PDFpenPro 8.3 - $124.95
PDFpenPro allows users to edit PDF's easily. Add text, images and signatures. Fill out PDF forms. Merge or split PDF documents. Reorder and delete pages. Create fillable forms and tables of content... Read more
File Juicer 4.51 - Extract images, video...
File Juicer is a drag-and-drop can opener and data archaeologist. Its specialty is to find and extract images, video, audio, or text from files which are hard to open in other ways. In computer... Read more

Latest Forum Discussions

See All

PINE GROVE (Games)
PINE GROVE 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: A pine grove where there are no footsteps of people due to continuous missing cases. The case is still unsolved and nothing has... | Read more »
Niantic teases new Pokémon announcement...
After rumors started swirling yesterday, it turns out there is an official Pokémon GO update on its way. We’ll find out what’s in store for us and our growing Pokémon collections tomorrow during the Starbucks event, but Niantic will be revealing... | Read more »
3 reasons why Nicki Minaj: The Empire is...
Nicki Minaj is as business-savvy as she is musically talented and she’s proved that by launching her own game. Designed by Glu, purveyors of other fine celebrity games like cult favorite Kim Kardashian: Hollywood, Nicki Minaj: The Empire launched... | Read more »
Clash of Clans is getting its own animat...
Riding on its unending wave of fame and success, Clash of Clans is getting an animated web series based on its Clash-A-Rama animated shorts.As opposed to the current shorts' 60 second run time, the new and improved Clash-A-Rama will be comprised of... | Read more »
Leaks hint at Pokémon GO and Starbucks C...
Leaked images from a hub for Starbucks employees suggests that a big Pokémon GO event with the coffee giant could begin this very week. The images appeared on Reddit and hint at some exciting new things to come for Niantic's smash hit game. | Read more »
Silent Depth Submarine Simulation (Game...
Silent Depth Submarine Simulation 1.0 Device: iOS Universal Category: Games Price: $7.99, Version: 1.0 (iTunes) Description: | Read more »
Enneas Saga lets you lead your own demon...
Defend the land of Enneas Continent from the forces of evil in the new fantasy MMORPG from Lyto Mobi: Enneas Saga. Can’t wait? No problem. It’s available to download now on Android devices. | Read more »
Great zombie games in the spirit of Dead...
Dead Rising 4 arrives tomorrow, giving enthusiasts a fresh chance to take selfies with zombies and get up to other ridiculous end-of-the-world shenanigans. To really get into the spirit of things, we've gone and gathered the best zombie games that... | Read more »
Amateur Surgeon 4 Guide: Advanced tips a...
Amateur Surgeon 4 is still tackling the competition at the top of the App Store charts, so if you haven't tried it out yet, you should probably do that right away. If you've been at it for a while, though, perhaps you're ready to start expanding... | Read more »
Amateur Surgeon 4 Guide: Become the worl...
It's time to wield your trusty pizza cutter again, as Amateur Surgeon has returned with a whole fresh set of challenges (and some old, familiar ones, too). Starting anew isn't easy, especially when all you have at your disposal is a lighter, the... | Read more »

Price Scanner via MacPrices.net

Back in stock: Apple refurbished Mac minis fr...
Apple has Certified Refurbished Mac minis available starting at $419. Apple’s one-year warranty is included with each mini, and shipping is free: - 1.4GHz Mac mini: $419 $80 off MSRP - 2.6GHz Mac... Read more
Twenty-Five Years Of Apple Laptops – A person...
Among many other things, the often tumultuous 16th year of the new century marked the 25th anniversary of Apple laptop computers, not counting the optimistically named 16-pound Mac Portable of 1989.... Read more
Landlordy iOS App Adds Support For Appliances...
Riga, Latvia based E-protect SIA is releasing major update (version 1.8) to its Landlordy app for managing rental business financials on the go. Landlordy is iPhone and iPad app designed for self-... Read more
Holiday sale, Apple iMacs for up to $200 off...
B&H Photo has 21″ and 27″ Apple iMacs on sale for up to $200 off MSRP including free shipping plus NY sales tax only: - 27″ 3.3GHz iMac 5K: $2099 $200 off MSRP - 27″ 3.2GHz/1TB Fusion iMac 5K: $... Read more
Holiday sale: Mac minis for $50 to $100 off M...
B&H Photo has Mac minis on sale for up to $100 off MSRP free shipping plus NY sales tax only: - 1.4GHz Mac mini: $449 $50 off MSRP - 2.6GHz Mac mini: $629 $70 off MSRP - 2.8GHz Mac mini: $899 $... Read more
Mac Pros on sale for up to $300 off MSRP, ref...
B&H Photo has Mac Pros on sale for up to $300 off MSRP. Shipping is free, and B&H charges sales tax in NY only: - 3.7GHz 4-core Mac Pro: $2799, $200 off MSRP - 3.5GHz 6-core Mac Pro: $3699, $... Read more
12-inch WiFi Apple iPad Pros on sale for up t...
B&H Photo has 12″ WiFi Apple iPad Pros on sale for up to $50 off MSRP, each including free shipping. B&H charges sales tax in NY only: - 12″ Space Gray 32GB WiFi iPad Pro: $749 $50 off MSRP... Read more
9-inch Apple WiFi iPad Pros on sale for $20-$...
B&H Photo has 9.7″ Apple WiFi iPad Pros on sale for $20-$50 off MSRP, each including free shipping. B&H charges sales tax in NY only: - 9″ Space Gray 256GB WiFi iPad Pro: $779.95 $20 off MSRP... Read more
Apple refurbished 2015 15-inch MacBook Pros a...
Apple has Certified Refurbished 2015 15″ Retina MacBook Pros available starting at $1699. An Apple one-year warranty is included with each model, and shipping is free: - 15″ 2.2GHz Retina MacBook Pro... Read more
Back in stock! 13-inch 2.7GHz Retina MacBook...
Apple has Apple Certified Refurbished 2015 13″ 2.7GHz/128GB Retina MacBook Pros (MF839LL/A) available again for $1099 including free shipping. That’s $200 off MSRP, and it’s the lowest price... Read more

Jobs Board

*Apple* Retail - Multiple Positions- Philade...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
*Apple* Retail - Multiple Positions- San Ant...
Job Description:SalesSpecialist - Retail Customer Service and SalesTransform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
*Apple* Products Tester Needed - Apple (Unit...
…we therefore look forward to put out products to quality test for durability. Apple leads the digital music revolution with its iPods and iTunes online store, Read more
SW Engineer *Apple* TV Frameworks - Apple I...
The Apple TV team is looking for a software...create features that reflect the look and feel of Apple TV. Description: Were looking for someone who is Read more
Hardware Design Validation Engineer - *Apple...
The Apple Watch team is looking for a Hardware Design Validation Engineer. This person will be part of the Apple Watch hardware team with responsibilities for Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.