TweetFollow Us on Twitter

June 94 - BALANCE OF POWER

BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

[IMAGE 055-057_Balance_of_Power1.GIF]

When you convert your applications to native PowerPC code, they run lightning fast. To get the most out of RISC processors, however, you need to pay close attention to your code structure and execution. Fast code is no longer measured solely by an instruction timing table. The Power PC 601 processor includes pipelining, multi-issue and speculative execution, branch prediction, and a set associative cache. All these things make it hard to know what code will run fastest on a Power Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good optimizing compiler to help you. In this column I'll pass on some of what I've learned about tuning Power PC code. There are gotchas and coding habits to avoid, and there are techniques for squeezing the most from your speed-critical native code. For a good introduction to RISC pipelining and related concepts that appear in this column, see "Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every machine clock cycle, but RISC processors can do this only in the best of circumstances. At their worst they're as slow as CISC processors. The following loop, for example, averages only one calculation every 2.8 cycles:

float a[], b[], c[], d, e;
for (i=0; i < gArraySize; i++) {
  e = b[i] + c[i] / d;
  a[i] = MySubroutine(b[i], e);
}

By restructuring the code and using other techniques from this column, you can make significant improvements. This next loop generates the same result, yet averages one calculation every 1.9 cycles -- about 50% faster.

reciprocalD = 1 / d;
for (i=0; i < gArraySize; i+=2) {
  float result, localB, localC, localE;
  float result2, localB2, localC2, localE2;

  localB = b[i];
  localC = c[i];
  localB2 = b[i+1];
  localC2 = c[i+1];

  localE = localB + (localC * reciprocalD);
  localE2 = localB2 + (localC2 * reciprocalD);
  InlineSubroutine(&result, localB, localE);
  InlineSubroutine(&result2, localB2, localE2);

  a[i] = result;
  a[i+1] = result2;
}

The rest of this column explains the techniques I just used for that speed gain. They include expanding loops, scoping local variables, using inline routines, and using faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its point of view. You should understand how it looks at your code and what assumptions and optimizations it's allowed to make. The more you empathize with your compiler, the more you'll recognize opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code line by line usually isn't optimal, because the processor stalls to wait for dependent instructions. The compiler tries to move instr uctions that are independent into the stall points. For example, consider this code:

first = input * numerator;
second = first / denominator;
output = second + adjustment;

Each line depends on the previous line's result, and the compiler will be hard pressed to keep the pipeline full of useful work. This simple example could cause 46 stalled cycles on the PowerPC 601, so the compiler will look at other nearby code for independent instructions to move into the stall points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance in several ways. Loop expanding is one of the simplest methods. The idea is to perform more than one independent operation in a loop, so that the compiler can reorder more work in the pipeline and thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

float a[], b[], c[], d;
for (i=0; i < multipleOfThree; i++) {
  a[i] = b[i] + c[i] * d;
}

If we know the data always occurs in certain sized increments, we can do more steps in each iteration, as in the following:

for (i=0; i < multipleOfThree; i+=3) {
  a[i] = b[i] + c[i] * d;
  a[i+1] = b[i+1] + c[i+1] * d;
  a[i+2] = b[i+2] + c[i+2] * d;
}

On a CISC processor the second loop wouldn't be much faster, but on the Power PC processor the second loop is twice as fast as the first. This is because the compiler can schedule independent instructions to keep the pipeline constantly moving. (If the data doesn't occur in nice increments, you can still expand the loop; just add a small loop at the end to handle the extra iterations.)Be careful not to expand a loop too much, though. Very large loops won't fit in the cache, causing cache misses for each iteration. In addition, the larger a loop gets, the less work can be done entirely in registers. Expand too much and the compiler will have to use memory  to store intermediate results, outweighing your marginal gains. Besides, you get the biggest gains from the first few expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the PowerPC chip -- 32 general registers and 32 floating-point registers. By having so many, the processor can often avoid slow memory operations. Your compiler will take advantage of this when it can, but you can help it by carefully scoping your variables and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler examines the scope of each variable when it schedules registers, and your code can provide valuable information about the usage of each variable. Here's an example:

for (i=0; i < gArraySize; i++) {
  a[i] = MyFirstRoutine(b[i], c[i]);
  b[i] = MySecondRoutine(a[i], c[i]);
} 

In this loop, the global variable gArraySize is scoped for the whole program. Because we call a subroutine in the loop, the compiler can't tell if gArraySize will change during each iteration. Since the subroutine might modify gArraySize, the compiler has to be conservative. It will reload gArraySize from memory on every iteration, and it won't optimize the loop any further. This is wastefully slow.

On the other hand, if we use a local  variable, we tell the compiler that gArraySize and c[i] won't be modified and that it's all right to just keep them handy in registers. In addition, we can store data as temporary variables scoped only within the loop. This tells the compiler how we intend to use the data, so that the compiler can use free registers and discard them after the loop. Here's what this would look like:

arraySize = gArraySize;
for (i=0; i < arraySize; i++) {
  float localC;
  localC = c[i];
  a[i] = MyFirstRoutine(b[i], localC);
  b[i] = MySecondRoutine(a[i], localC);
} 

These minor changes give the compiler more information about the data, in this instance accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated dereferences and typecasting, chances are the object code has wasteful memory instructions and inefficient register usage. A great compiler might optimize well anyway, but don't count on it. Judicious use of temporary variables (as mentioned above) will help the compiler understand exactly what you're doing -- plus your code will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles on the Macintosh. Code often contains double memory dereferences, which is important when memory can move. But when you can guarantee that memory won't  move, use a local pointer, so that you only dereference a handle once. This saves load instructions and allows fur ther optimizations. Casting data types is usually a free operation -- you're just telling the compiler that you know you're copying seemingly incompatible data. But it's not  free if the data types have different bit sizes, which adds conversion instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true that often the pipeline can keep moving even though a branch is encountered, because the branch execution unit will try to resolve branches very early in the pipeline or will predict the direction of the branch. Still, the more subroutines you have, the less your compiler will be able to reorder and intelligently schedule instructions. Keep speed-critical code together, so that more of it can be pipelined and the compiler can schedule your registers better. Use inline routines for short operations, as I did in the improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know about. Some are processor model specific. For example, the PowerPC 601 has 32K of cache, while the 603 has 16K split evenly into an instruction cache and a data cache. But in general you should know about floating-point performance and the virtues of memory alignment.

Floating-point multiplication is wicked fast -- up to nine times  the speed of integer multiplication. Use floating-point multiplication if you can. Floating-point division takes 17 times as long, so when possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to unaligned data stall while the processor loads different words and then shifts and splices them. For example, be sure to align floating-point data to 64-bit boundaries, or you'll stall for four cycles while the processor loads 32-bit halves with two 64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about tweaking its performance at all. For your speed-critical code, though, these tips I've given you can make the difference between "too slow" and "fast enough."

RECOMMENDED READING

  • High-Performance Computing  by Kevin Dowd (O'Reilly & Associates, Inc., 1993).
  • High-Performance Computer Architecture  by Harold S. Stone (Addison-Wesley, 1993).
  • PowerPC 601 RISC Microprocessor User's Manual (Motorola, 1993).

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's been repeatedly thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle. Fixing engine stalls, poor timing, and rough starts proved difficult, but he was recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization option, and then run on a Power Macintosh 6100/66 for profiling. A PowerPC 601 microsecond timing library is provided on this issue's CD. *

 
AAPL
$423.31
Apple Inc.
-8.46
MSFT
$34.66
Microsoft Corpora
-0.32
GOOG
$902.17
Google Inc.
+1.55

MacTech Search:
Community Search:

Software Updates via MacUpdate

Apple Java 2013-004 - For OS X 10.7 and...
Apple Java for OS X 2013-004 supersedes all previous versions of Java for OS X. This release updates the Apple-provided system Java SE 6 to version 1.6.0_51 and is for OS X versions 10.7 or later.... Read more
Google Chrome 27.0.1453.116 - Modern and...
Google Chrome is a Web browser by Google, created to be a modern platform for Web pages and applications. It utilizes very fast loading of Web pages and has a V8 engine, which is a custom built... Read more
EarthDesk 6.2 - Striking animated image...
EarthDesk replaces your static desktop picture with a rendered image of Earth showing correct sun, moon and city illumination. With an Internet connection, EarthDesk displays near real-time global... Read more
Apple Configurator 1.3 - Configure and d...
Apple Configurator makes it easy for anyone to mass configure and deploy iPhone, iPad, and iPod touch in a school, business, or institution. Three simple workflows let you prepare new iOS devices... Read more
Apple Java for Mac OS X 10.6 Update 16 -...
Apple Java for Mac OS X 10.6 Update 16 delivers improved security, reliability, and compatibility by updating Java SE 6 to 1.6.0_51.Version Update 16: See http://support.apple.com/kb/HT5744 for more... Read more
Neat 4.0.3 - Digital filing system for r...
Neat (formerly NeatWorks) is a powerful scanning and digital filing system that enables you to scan and organize receipts, business cards, and documents. Unlike other scanning software, NeatWorks... Read more
Adobe Muse CC 5.0 - Design and publish H...
Adobe Muse enables designers to create websites as easily as creating a layout for print. Design and publish original HTML pages using the latest Web standards, and without writing code. Now in beta... Read more
Adobe Creative Cloud 1.0 - Everything ne...
Adobe Creative Cloud costs $49.99/month (or less if you're a previous Creative Suite customer). Creative Suite 6 is still available for purchase (without a monthly plan) if you prefer. Introducing... Read more
Adobe Flash Professional CC 13.0.0.759 -...
Flash Professional CC is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous Flash Professional customer). Flash Professional CS6 is still... Read more
Adobe InCopy CC 9.0 - Create streamlined...
InCopy CC is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous InCopy customer). InCopy CS6 is still available for purchase (without a... Read more

Latest Forum Discussions

See All

Calendars+ by Readdle Goes Free For A Ve...
Calendars+ by Readdle Goes Free For A Very Limited Time Posted by Andrew Stevens on June 19th, 2013 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
Modern Combat 4: Zero Hour Has A Meltdow...
Modern Combat 4: Zero Hour Has A Meltdown, Gets New Maps, Multiplayer Modes, and More Posted by Andrew Stevens on June 19th, 2013 [ permalink ] | Read more »
XCOM: Enemy Unknown – Commander’s Log: H...
Part of the series 148Apps Goes Deep on XCOM: Enemy Unknown I’m still haunted by visions of a parallel world (classified as Xbox 360) as it wasn’t long ago that I was in charge of the XCOM project and led a squadron of soldiers against an alien... | Read more »
Rovio Stars: The Angry Birds’ New Publis...
Rovio Entertainment, creators of Angry Birds, has a new publishing initiative called Rovio Stars that will see its first titles Icebreaker and Tiny Thief released soon. Kalle Kaivola, Senior Vice President of Product & Publishing at Rovio... | Read more »
Favorite Four: Soccer Games
As a soccer fan, I’m getting twitchy. The Confederations Cup might be helping a little, but I miss the English Premier League week in, week out. This is where I sink time into FIFA 13 on my console in order to counteract the problem. What about... | Read more »
Knights of Pen & Paper Adds More Dun...
Knights of Pen & Paper Adds More Dungeons and Loot In Free Update Posted by Andrew Stevens on June 19th, 2013 [ permalink ] | Read more »
Froot ‘n’ Nutz Review
Froot ‘n’ Nutz Review By Blake Grundman on June 19th, 2013 Our Rating: :: VISUALLY DICEYUniversal App - Designed for iPhone and iPad While Froot ‘n’ Nutz may not look very modern, it is very likable.   | Read more »
148Apps Goes Deep on XCOM: Enemy Unknown
XCOM: Enemy Unknown will be released tonight for iPad and iPhone. And we’re very excited. While XCOM isn’t the first console game to be ported over to iOS, it is one of the most ambitious. XCOM: Enemy Unknown while first released for XBox 360 and... | Read more »
A Cautionary Tail – An Interactive Book...
A Cautionary Tail – An Interactive Book That Teaches Self-Acceptance Posted by Andrew Stevens on June 19th, 2013 [ permalink ] | Read more »
XCOM: Enemy Unknown – Cheats, Tips, and...
The X-Com series, particularly the earlier games, are notoriously unforgiving. Although while XCOM: Enemy Unknown has been modernized, and is therefore more player friendly, it’s no slouch either. In fact, even on the Normal difficulty there’s a... | Read more »

Price Scanner via MacPrices.net

Smaller Tablets Forecast To Get Even More Popular...
The DisplaySearch Blog’s Richard Shim notes that tablet PCs with screen sizes smaller than 9 inches are currently forecast to account for 66% of tablet PC shipments for the year but that share is... Read more
Updated iPad Price Trackers
We’ve updated our iPad Price Tracker and our iPad mini Price Tracker with the latest information on prices and availability from Apple and other resellers. Read more
Apple refurbished iPod nanos available for $99
The Apple Store has Apple Certified Refurbished 16GB iPod nanos available for $99 including free shipping and Apple’s standard one-year warranty. That’s $50 off the cost of new nanos. All colors are... Read more
iFixIt Tears Down mid-2013 11.6-inch MacBook Air
iFixIt Chief Information Architect Miroslav Djuric says: The epic week of disassembly continues: Today, the MacBook Air 11″ found its way onto our teardown table and was soon just another Apple in... Read more
Mature Consumers Know When They Need a PC
Tech.Pinions’ Ben Bajarin sensibly observes that one of the fundamental characteristics of a mature market is mature consumers – mature in the sense that they know what they want and more importantly... Read more
Windows 8 Continues Ascension in User Popularity R...
Softpedia’s Bogdan Popa notes that Windows 8 is now the fourth most popular operating system in the world, and according to some new statistics, it continues to gain new users every day. Popa cites... Read more
Apple iOS and OS X Updates Put Bluetooth Smart Rea...
From its Worldwide Developers Conference last week, Apple announced unprecedented integration of Bluetooth technology into its operating systems – a move that sets the bar for Bluetooth integration... Read more
Buy a 13″ MacBook Pro, get AppleCare for as little...
Adorama has 13″ MacBook Pros bundled with 3-year AppleCare Protection Plans for as little as $40 extra (AppleCare has an MSRP of $249 for 13-inch MacBook Pros). Shipping is free, and Adorama charges... Read more
Updated MacBook Price Trackers
We’ve updated our MacBook Price Trackers with the latest information on prices, bundles, and availability on MacBook Airs, MacBook Pros, and the MacBook Pros with Retina Displays from Apple’s... Read more
Save $140 on the 15″ 2.3GHz MacBook Pro
B&H Photo has the 15″ 2.3GHz MacBook Pro on sale for $1659 including free shipping. Their price is $140 off MSRP. B&H will include free copies of Parallels Desktop, Bento Database, and LoJack... Read more

Jobs Board

*Apple* At-Home Team Manager - Apple (U...
Changing the world is all in a day's work at Apple . If you love innovation, here's your chance to make a career of it. You'll work hard. But the job comes with more than Read more
*Apple* Retail - Manager - Apple (Unite...
Job SummaryKeeping an Apple Store thriving requires a diverse set of leadership skills, and as a Manager, youre a master of them all. In the stores fast-paced, dynamic Read more
*Apple* - Solution Architect - CompuCom...
Job Location: US-TX-Dallas Posted Date: 4/18/2013 Overview: The Apple Solution Architect (SA) will be responsible for supporting pre-sales and post-sales solutions in Read more
*Apple* Support Technician; Mid-level -...
A Kforce client in Washington, DC area is seeking an Apple Support Technician. This contractor will have the following types of responsibilities including, but not Read more
Systems Engineer - *Apple* TV - Apple...
Job Summary The Apple TV team is looking for an experienced engineer with a passion for delivering first in class home entertainment solutions. The individual must be Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.