TweetFollow Us on Twitter

June 94 - BALANCE OF POWER

BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

[IMAGE 055-057_Balance_of_Power1.GIF]

When you convert your applications to native PowerPC code, they run lightning fast. To get the most out of RISC processors, however, you need to pay close attention to your code structure and execution. Fast code is no longer measured solely by an instruction timing table. The Power PC 601 processor includes pipelining, multi-issue and speculative execution, branch prediction, and a set associative cache. All these things make it hard to know what code will run fastest on a Power Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good optimizing compiler to help you. In this column I'll pass on some of what I've learned about tuning Power PC code. There are gotchas and coding habits to avoid, and there are techniques for squeezing the most from your speed-critical native code. For a good introduction to RISC pipelining and related concepts that appear in this column, see "Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every machine clock cycle, but RISC processors can do this only in the best of circumstances. At their worst they're as slow as CISC processors. The following loop, for example, averages only one calculation every 2.8 cycles:

float a[], b[], c[], d, e;
for (i=0; i < gArraySize; i++) {
  e = b[i] + c[i] / d;
  a[i] = MySubroutine(b[i], e);
}

By restructuring the code and using other techniques from this column, you can make significant improvements. This next loop generates the same result, yet averages one calculation every 1.9 cycles -- about 50% faster.

reciprocalD = 1 / d;
for (i=0; i < gArraySize; i+=2) {
  float result, localB, localC, localE;
  float result2, localB2, localC2, localE2;

  localB = b[i];
  localC = c[i];
  localB2 = b[i+1];
  localC2 = c[i+1];

  localE = localB + (localC * reciprocalD);
  localE2 = localB2 + (localC2 * reciprocalD);
  InlineSubroutine(&result, localB, localE);
  InlineSubroutine(&result2, localB2, localE2);

  a[i] = result;
  a[i+1] = result2;
}

The rest of this column explains the techniques I just used for that speed gain. They include expanding loops, scoping local variables, using inline routines, and using faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its point of view. You should understand how it looks at your code and what assumptions and optimizations it's allowed to make. The more you empathize with your compiler, the more you'll recognize opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code line by line usually isn't optimal, because the processor stalls to wait for dependent instructions. The compiler tries to move instr uctions that are independent into the stall points. For example, consider this code:

first = input * numerator;
second = first / denominator;
output = second + adjustment;

Each line depends on the previous line's result, and the compiler will be hard pressed to keep the pipeline full of useful work. This simple example could cause 46 stalled cycles on the PowerPC 601, so the compiler will look at other nearby code for independent instructions to move into the stall points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance in several ways. Loop expanding is one of the simplest methods. The idea is to perform more than one independent operation in a loop, so that the compiler can reorder more work in the pipeline and thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

float a[], b[], c[], d;
for (i=0; i < multipleOfThree; i++) {
  a[i] = b[i] + c[i] * d;
}

If we know the data always occurs in certain sized increments, we can do more steps in each iteration, as in the following:

for (i=0; i < multipleOfThree; i+=3) {
  a[i] = b[i] + c[i] * d;
  a[i+1] = b[i+1] + c[i+1] * d;
  a[i+2] = b[i+2] + c[i+2] * d;
}

On a CISC processor the second loop wouldn't be much faster, but on the Power PC processor the second loop is twice as fast as the first. This is because the compiler can schedule independent instructions to keep the pipeline constantly moving. (If the data doesn't occur in nice increments, you can still expand the loop; just add a small loop at the end to handle the extra iterations.)Be careful not to expand a loop too much, though. Very large loops won't fit in the cache, causing cache misses for each iteration. In addition, the larger a loop gets, the less work can be done entirely in registers. Expand too much and the compiler will have to use memory  to store intermediate results, outweighing your marginal gains. Besides, you get the biggest gains from the first few expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the PowerPC chip -- 32 general registers and 32 floating-point registers. By having so many, the processor can often avoid slow memory operations. Your compiler will take advantage of this when it can, but you can help it by carefully scoping your variables and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler examines the scope of each variable when it schedules registers, and your code can provide valuable information about the usage of each variable. Here's an example:

for (i=0; i < gArraySize; i++) {
  a[i] = MyFirstRoutine(b[i], c[i]);
  b[i] = MySecondRoutine(a[i], c[i]);
} 

In this loop, the global variable gArraySize is scoped for the whole program. Because we call a subroutine in the loop, the compiler can't tell if gArraySize will change during each iteration. Since the subroutine might modify gArraySize, the compiler has to be conservative. It will reload gArraySize from memory on every iteration, and it won't optimize the loop any further. This is wastefully slow.

On the other hand, if we use a local  variable, we tell the compiler that gArraySize and c[i] won't be modified and that it's all right to just keep them handy in registers. In addition, we can store data as temporary variables scoped only within the loop. This tells the compiler how we intend to use the data, so that the compiler can use free registers and discard them after the loop. Here's what this would look like:

arraySize = gArraySize;
for (i=0; i < arraySize; i++) {
  float localC;
  localC = c[i];
  a[i] = MyFirstRoutine(b[i], localC);
  b[i] = MySecondRoutine(a[i], localC);
} 

These minor changes give the compiler more information about the data, in this instance accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated dereferences and typecasting, chances are the object code has wasteful memory instructions and inefficient register usage. A great compiler might optimize well anyway, but don't count on it. Judicious use of temporary variables (as mentioned above) will help the compiler understand exactly what you're doing -- plus your code will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles on the Macintosh. Code often contains double memory dereferences, which is important when memory can move. But when you can guarantee that memory won't  move, use a local pointer, so that you only dereference a handle once. This saves load instructions and allows fur ther optimizations. Casting data types is usually a free operation -- you're just telling the compiler that you know you're copying seemingly incompatible data. But it's not  free if the data types have different bit sizes, which adds conversion instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true that often the pipeline can keep moving even though a branch is encountered, because the branch execution unit will try to resolve branches very early in the pipeline or will predict the direction of the branch. Still, the more subroutines you have, the less your compiler will be able to reorder and intelligently schedule instructions. Keep speed-critical code together, so that more of it can be pipelined and the compiler can schedule your registers better. Use inline routines for short operations, as I did in the improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know about. Some are processor model specific. For example, the PowerPC 601 has 32K of cache, while the 603 has 16K split evenly into an instruction cache and a data cache. But in general you should know about floating-point performance and the virtues of memory alignment.

Floating-point multiplication is wicked fast -- up to nine times  the speed of integer multiplication. Use floating-point multiplication if you can. Floating-point division takes 17 times as long, so when possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to unaligned data stall while the processor loads different words and then shifts and splices them. For example, be sure to align floating-point data to 64-bit boundaries, or you'll stall for four cycles while the processor loads 32-bit halves with two 64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about tweaking its performance at all. For your speed-critical code, though, these tips I've given you can make the difference between "too slow" and "fast enough."

RECOMMENDED READING

  • High-Performance Computing  by Kevin Dowd (O'Reilly & Associates, Inc., 1993).
  • High-Performance Computer Architecture  by Harold S. Stone (Addison-Wesley, 1993).
  • PowerPC 601 RISC Microprocessor User's Manual (Motorola, 1993).

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's been repeatedly thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle. Fixing engine stalls, poor timing, and rough starts proved difficult, but he was recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization option, and then run on a Power Macintosh 6100/66 for profiling. A PowerPC 601 microsecond timing library is provided on this issue's CD. *

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

beaTunes 4.6.14 - Organize your music co...
beaTunes is a full-featured music player and organizational tool for music collections. How well organized is your music library? Are your artists always spelled the same way? Any R.E.M. vs REM?... Read more
iDefrag 5.1.8 - Disk defragmentation and...
iDefrag helps defragment and optimize your disk for improved performance. Features include: Supports HFS and HFS+ (Mac OS Extended). Supports case sensitive and journaled filesystems. Supports... Read more
Day One 2.1.8 - Maintain a daily journal...
Day One is the easiest and best-looking way to use a journal / diary / text-logging application for the Mac. Day One is well designed and extremely focused to encourage you to write more through... Read more
Spotify 1.0.53.758. - Stream music, crea...
Spotify is a streaming music service that gives you on-demand access to millions of songs. Whether you like driving rock, silky R&B, or grandiose classical music, Spotify's massive catalogue puts... Read more
VirtualBox 5.1.20 - x86 virtualization s...
VirtualBox is a family of powerful x86 virtualization products for enterprise as well as home use. Not only is VirtualBox an extremely feature rich, high performance product for enterprise customers... Read more
Arq 5.7.9 - Online backup to Google Driv...
Arq is super-easy online backup for Mac and Windows computers. Back up to your own cloud account (Amazon Cloud Drive, Google Drive, Dropbox, OneDrive, Google Cloud Storage, any S3-compatible server... Read more
Vienna 3.1.10 :d05d7a5d: - RSS and Atom...
Vienna is a freeware and Open-Source RSS/Atom newsreader with article storage and management via a SQLite database, written in Objective-C and Cocoa, for the OS X operating system. It provides... Read more
WhiteCap 6.7 - Visual plug-in for iTunes...
WhiteCap is a sleek and sophisticated music visualizer and screensaver that features futuristic, wireframe mesh visuals with dynamic backgrounds and colors. WhiteCap contains thousands of visual... Read more
Dropbox 24.4.16 - Cloud backup and synch...
Dropbox is an application that creates a special Finder folder that automatically syncs online and between your computers. It allows you to both backup files and keep them up-to-date between systems... Read more
Amazon Chime 4.2.5645 - Amazon-based com...
Amazon Chime is a communications service that transforms online meetings with a secure, easy-to-use application that you can trust. Amazon Chime works seamlessly across your devices so that you can... Read more

Latest Forum Discussions

See All

ChordFlow (Music)
ChordFlow 1.0.0 Device: iOS Universal Category: Music Price: $6.99, Version: 1.0.0 (iTunes) Description: ChordFlow is a chord sequencer with a unique 4-track polyphonic arpeggiator, extensive chord library, MIDI out and Ableton Link... | Read more »
The Walking Dead: A New Frontier is out...
The newest season of Telltale Games'The Walking Dead is well underway. After the release of the third episode, "Above the Law" about a month ago, episode four, "Thicker Than Water" is hot and ready for more zombies and gut-wrenching emotional... | Read more »
Best games we played this week
Another week, another new wave of mobile games do dive into. We've dug through the list of apps that came out this week to tell you which apps are worth your sweet time. And while there weren't too many games this week, there were some big ones.... | Read more »
Vignettes (Games)
Vignettes 1.0.1 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0.1 (iTunes) Description: Vignettes is a casual but unique exploration game without text or characters, where objects shapeshift as you spin them around... | Read more »
Get Me Outta Here is an 80s retro shoote...
Are you ready to fight some aliens? Because Crescent Moon Games has released the retro shooter Get Me Outta Here on iOS devices today. [Read more] | Read more »
Get a bunch of Apple productivity apps f...
If you're an Apple Mac owner, you're probably aware of the host of Apple productivity apps the company includes in all new Mac purchases. Apps like iMovie, Keynote, and of course, GarageBand. While you used to be able to also buy these apps... | Read more »
Terra Mystica (Games)
Terra Mystica 1.03 Device: iOS Universal Category: Games Price: $9.99, Version: 1.03 (iTunes) Description: Short Summary:≈≈≈≈≈≈≈≈≈≈≈≈≈ | Read more »
Ms. Spell (Games)
Ms. Spell 1.0 Device: iOS Universal Category: Games Price: $.99, Version: 1.0 (iTunes) Description: Cast spells and battle monsters in this turn based game, that has you delving into ever the changing Dreadwood to retrieve the lost... | Read more »
Invert - A Minimal Puzzle Game (Games)
Invert - A Minimal Puzzle Game 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: Invert is a minimalist puzzle game for fans of brain teasers, logic puzzles, and Rubik's Cube fiddlers. | Read more »
Evergrow: Paper Forest (Games)
Evergrow: Paper Forest 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: Solve puzzles in the handcrafted forests of Evergrow through the eyes of an imaginative child and his parents. Discover... | Read more »

Price Scanner via MacPrices.net

15-inch Touch Bar MacBook Pros, Apple refurbi...
Apple is offering Certified Refurbished 2016 15″ Touch Bar MacBook Pros for $360 to $420 off original MSRP. An Apple one-year warranty is included with each model, and shipping is free: - 15″ 2.6GHz... Read more
13-inch MacBook Airs on sale for up to $150 o...
Overstock.com has 13″ MacBook Airs on sale for up to $150 off MSRP including free shipping: - 13″ 1.6GHz/128GB MacBook Air (sku MMGF2LL/A): $869.99 $130 off MSRP - 13″ 1.6GHz/256GB MacBook Air (sku... Read more
15-inch Touch Bar MacBook Pros on sale for $1...
B&H Photo has the new 2016 15″ Apple Touch Bar MacBook Pros in stock today and on sale for up to $200 off MSRP. Shipping is free, and B&H charges NY sales tax only: - 15″ 2.7GHz Touch Bar... Read more
15-inch 2.7GHz Touch Bar MacBook Pros on sale...
Amazon has 2016 15″ 2.7GHz Apple Touch Bar MacBook Pros in stock today and on sale for $150-$200 off MSRP. Shipping is free: - 15″ 2.7GHz Touch Bar MacBook Pro Space Gray (sku MLH42LL/A): $2599 $200... Read more
Apple now offering Certified Refurbished 13-i...
Apple is now offering Certified Refurbished 2016 13″ Touch Bar MacBook Pros for $270-$300 off original MSRP. An Apple one-year warranty is included with each model, and shipping is free: - 13″ 2.9GHz... Read more
MyGiHealth Digestive Symptom Tracker Version...
My Total Health, Inc. has announced the release of MyGiHealth 2.1, an important update to their digestive symptom tracker developed exclusively for iPhone, iPad and iPod touch devices. MyGiHealth is... Read more
Galaxy S8 Materials Costs Highest by Far Comp...
The new Samsung Galaxy S8 equipped with 64 gigabytes (GB) of NAND flash memory carries a bill of materials (BOM) cost that comes out to US$301.60, much higher than for previous versions of the... Read more
iCarMode 4.0 Car Dashboard App For iOS Integr...
Indie developer Diego Resnik has announced the release of iCarMode 4.0, an update to his productivity app developed for iOS devices. iCarMode has positioned itself as a true car dashboard app,... Read more
How to save $150+ on Apple’s 13-inch 2.0GHz n...
Apple Authorized Reseller B&H Photo has non-Touch Bar 13″ 2.0GHz MacBook Pros on sale for $150 off MSRP for a limited time. Shipping is free, and B&H charges NY sales tax only: - 13″ 2.0GHz... Read more
15-inch 2.2GHz Retina MacBook Pro, Apple refu...
Apple has Certified Refurbished 2015 15″ 2.2GHz Retina MacBook Pros available for $1699. That’s $300 off MSRP, and it’s the lowest price available for a 15″ MacBook Pro. An Apple one-year warranty is... Read more

Jobs Board

Music Marketing Lead, iTunes & *Apple*...
# Music Marketing Lead, iTunes & Apple Music Job Number: 56868140 Culver City, California, United States Posted: Apr. 17, 2017 Weekly Hours: 40.00 **Job Summary** Read more
*Apple* Media Products - Commerce Engineerin...
Apple Media Products - Commerce Engineering Manager Job Number: 57037480 Santa Clara Valley, California, United States Posted: Apr. 18, 2017 Weekly Hours: 40.00 Job Read more
*Apple* Mac Computer Technician - GeekHampto...
…complex computer issues over the phone and in person? GeekHampton, Long Island's Apple Premium Service Provider, is looking for you! Come work with our crew Read more
*Apple* Mobile Master - Best Buy (United Sta...
**493714BR** **Job Title:** Apple Mobile Master **Location Number:** 001024-Weatherford-Store **Job Description:** **What does a Best Buy Apple Mobile Master Read more
Best Buy *Apple* Computing Master - Best Bu...
**496963BR** **Job Title:** Best Buy Apple Computing Master **Location Number:** 001061-Marina-Store **Job Description:** **What does a Best Buy Apple Computing Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.