TweetFollow Us on Twitter

September 94 - BALANCE OF POWER

BALANCE OF POWER

Tuning PowerPCMemory Usage

DAVE EVANS

[IMAGE 017-019_Balance_of_Pwr_h1.GIF]

If you care about the performance of code you write for the Power Macintosh, memory usage should be your foremost concern. With the PowerPCTM601 processor today, and even more important with future processors, memory usage of your code will have the greatest effect on its performance. Poorly written code will execute at a fraction of its potential, and often very simple changes will greatly improve the execution speed of your critical code.

Processors are improving much faster than the memory subsystems that support them. As the PowerPC chips move from 80 MHz to 100 MHz and beyond, their thirst for data to process and instructions to execute will increasingly tax memory. Memory caches attempt to mitigate that thirst, and all PowerPC processors come equipped with built-in caches. But your code can work well with a cache or it can work very poorly with a cache. I'll show you why and discuss what you can do to optimize your memory usage.

CACHE BASICS
As you know, a cache is simply very fast RAM that the processor can access quickly and that it uses to store recently referenced data and code. On the PowerPC processors, any data stored in the cache can be accessed without stalling the processor's pipeline. Accesses to data not in the cache will take about 20 times as long reading from main memory, or even 1 million times as long if the access causes a page fault with virtual memory. Getting and keeping your performance-critical code and data in the cache are therefore key to your execution speed.

A cache is divided into small blocks calledcache lines . On the PowerPC 601, for example, the cache has 1024 cache line blocks, each holding 32 bytes. In addition, the 601 will fetch two blocks when it can, making the cache line size effectively 64 bytes.

The first PowerPC processors have set associative caches of different sizes. The 601 has an eight-way set associative, unified cache that's 32K in size, and the 603 has a two-way set associative, split cache with 8K for data and 8K for instruction code. The termset associative refers to the way the cache relates to main memory, which is important to your performance. In some simple caching schemes, each cache line maps directly to specific areas of main memory; any access to one of these areas loads bytes into that cache line. But on the PowerPC processor, sets of cache lines are combined and then mapped to memory. There are eight cache lines in each set on the 601, and two in each set on the 603. An access to one of the areas mapped to a set will load bytes into the last-used cache line of the set, keeping the most frequently used cache lines from being purged. This more complicated scheme typically yields much better performance than the directly mapped cache.

CACHE THRASHING
The cache will most affect your performance when you're accessing large amounts of data. A typical example of this is walking through arrays to perform some operation. The best strategy is to minimize cache collisions during your accesses, and the best tactic for this is to access your data as sequentially as possible. If you walk through memory sequentially, you'll load the cache every 64 bytes, but all 64 bytes will be available for fast processing. Here's an example:

unsigned longdata[64][1024];
for (row = 2; row < 64; row++)
    for (column = 0; column < 1024; column++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

This example performs additions on each element of a large matrix and accesses that matrix sequentially in memory. It walks across each row, adding elements and storing the result. But just inverting the loops can significantly change the way memory is accessed:

unsigned longdata[64][1024];
for (column = 0; column < 1024; column++)
    for (row = 2; row < 64; row++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

Reversing the loops leads to less than optimal performance since we perform each addition for all the columns before moving to the next element of a row. Instead of sequential access, this access pattern jumps across memory in even steps of 4K. Unfortunately, on the PowerPC processor these accesses map to the same set of cache lines, and every operation causes the cache to reload from main memory. This second example takes twice as long to execute the same calculation on a Power Macintosh 6100/60.

By paying attention to how your code accesses memory, you can avoid serious cache thrashing like that done by the second example. Things to look out for are loops that iterate for a power of 2 steps (128, 256, and so on)and code whose memory accesses are not close together.

An approach called blocking may help your loops. Often your code isn't as simple as above, and your memory accesses aren't regular during the loop. If you're walking two different arrays with different increments through memory, it may be impossible to serialize your accesses. Blocking performs the calculations in blocks of rows and columns. Instead of iterating across all the columns and then proceeding to the next row, you divide the dimensional space into blocks and calculate one whole block at a time. In this next example, we calculate the multiplication of two matrices.

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column++) {
        long    sum = 0;
        for (i = 0; i < 128; i++)
            sum += foo[row][i] * bar[i][column];
        result[row][column] = sum;
    }

As this algorithm walks through memory, it accesses result and foo sequentially, but bar is accessed in 256-byte steps. Accessing bar by jumping through memory causes cache misses, and sequential elements of bar are flushed from the cache before they're needed.

By performing this operation in small blocks, we can better use the cache. The key is to use all the elements of foo and bar that are in a cache line before moving on. One way to do this is to expand the loop and perform four operations in a single iteration:

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column += 4) {
        long    sum1 = 0, sum2 = 0;
        long    sum3 = 0, sum4 = 0;
        for (i = 0; i < 128; i++) {
            sum1 += foo[row][i] * bar[i][column];
            sum2 += foo[row][i] * bar[i][column+1];
            sum3 += foo[row][i] * bar[i][column+2];
            sum4 += foo[row][i] * bar[i][column+3];
    }
    result[row][column] = sum1;
    result[row][column+1] = sum2;
    result[row][column+2] = sum3;
    result[row][column+3] = sum4;
    }

This expanded loop calculates a block of four cells in each iteration. This executes faster because elements of bar are read from the cache and don't always cause cache misses as in the earlier example. Notice that in the expanded inner loop, a cache line of the bar matrix will be loaded the first time that it's referenced; then the following three references to bar will occur without stalling. Using the bar elements while they're still in the cache gives us a significant improvement.

CODE STYLE
Good compilers can pay attention to your memory accesses and will optimize how you access memory. For example, load and store operations can be reordered by the compiler to occur when the data is most likely to be available. The first time data is accessed it tends to cause a cache line to load, and subsequent accesses to nearby data must also wait for the cache load to complete. The compiler may be able to help by inserting a few instructions between the loads. This way the cache line will be fully loaded when the subsequent accesses are needed.

For more information on loop expansion and instruction reordering, see the Balance of Power column in develop Issue 18.*

You can help your compiler by using local variables when you can. These tell the compiler exactly how the data will be used, enabling it to easily reorder the loads and stores for this data.

You should also carefully note memory dereferences, especially double dereferences. Although it may be obvious to you, the compiler often can't tell whether two pointers address the same object in memory. The compiler may be prevented from reordering instructions because it can't tell whether two operations are really dependent on each other, just because they contain dereferences. Here's an example:

paramBlock->size = myStructure->size;
paramBlock->offset = myStructure->offset;

Although it appears obvious, the compiler usually can't tell if paramBlock references the same memory as myStructure. In the resulting binary, the compiler will be conservative and not reorder these operations for best execution. Replacing the dereference of myStructure with local variables for size and offset will allow the compiler to fully optimize this example.

INSTRUCTION THRASHING
Your code binary itself can cause the cache to thrash as it loads to be executed. This is very hard to detect and optimize. The basic problem is that your subroutines may map to the same areas of the cache, and frequent calls among them will stall to reload the cache. Some code profilers for RISC workstations have attempted to detect this problem, but for the Macintosh I can't suggest much help. Just changing the link order of your code and then executing profiles may have an effect; some link orders will thrash more than others.

DATA STRUCTURES
The layout of your data structures can greatly affect your cache usage and your memory usage in general. For example, memory accesses that cross 64-bit memory boundaries take twice as long to process, as this forces two bus transactions. On the PowerPC 601 processor, any misaligned data access within a memory boundary takes the standard amount of time, which (because of typical Macintosh data structures) is a valuable feature of the chip; future PowerPC processors, however, may take longer to access misaligned data. If you can align your data structures, do so now. A good tactic is to keep 64-bit data at the top of your structure, followed by your 32-bit data, and so on to prevent accidental misalignment of elements. Pad the end of the structure to an even 64-bit increment if you will have arrays of structures or will allocate them on the stack. And if certain parts of your structure are accessed much more often than other parts, keep these together so that they stay in the cache, and make sure they're aligned.

DON'T FORGET
The memory usage of your speed-critical code will greatly affect its performance today, and current problems will just get worse when PowerPC processors go above 100 MHz. Profile your code to find the most critical bottlenecks; then pay close attention to how that code addresses memory. You'll be rewarded with an excellent return on your investment.

DAVE EVANS occasionally uses the combinatorics skills he learned at the Massachusetts Institute of Technology, but more often he's been practicing his combination punches at a Thai kickboxing gym. Designing fast algorithms for Apple's OS Platforms Group is definitely rewarding, but developing a fast left hook really gets him pumped up. *

Thanks to Tom Adams, Mike Cappella, Rob Johnston, and Mike Neil for reviewing this column. *

 
AAPL
$517.96
Apple Inc.
-3.72
MSFT
$39.75
Microsoft Corpora
+0.57
GOOG
$536.44
Google Inc.
+3.92

MacTech Search:
Community Search:

Software Updates via MacUpdate

Starcraft II: Wings of Liberty 1.1.1.180...
Download the patch by launching the Starcraft II game and downloading it through the Battle.net connection within the app. Starcraft II: Wings of Liberty is a strategy game played in real-time. You... Read more
Sibelius 7.5.0 - Music notation solution...
Sibelius is the world's best-selling music notation software for Mac. It is as intuitive to use as a pen, yet so powerful that it does most things in less than the blink of an eye. The demo includes... Read more
Typinator 5.9 - Speedy and reliable text...
Typinator turbo-charges your typing productivity. Type a little. Typinator does the rest. We've all faced projects that require repetitive typing tasks. With Typinator, you can store commonly used... Read more
MYStuff Pro 2.0.16 - Create inventories...
MYStuff Pro is the most flexible way to create detail-rich inventories for your home or small business. Add items to MYStuff by dragging and dropping existing information, uploading new images, or... Read more
TurboTax 2013.r17.002 - Manage your 2013...
TurboTax guides you through your tax return step by step, does all the calculations, and checks your return for errors and overlooked deductions. It lets you file your return electronically to get... Read more
TrailRunner 3.8.769 - Route planning for...
Note: While the software is classified as freeware, it is actually donationware. Please consider making a donation to help support development. TrailRunner is the perfect companion for runners,... Read more
Flavours 1.1.10 - Create and apply theme...
Flavours is a Mac application that allow users to create, apply and share beautifully designed themes. Classy Give your Mac a gorgeous new look by applying delicious themes! Easy Unleash your... Read more
Spotify 0.9.8.296. - Stream music, creat...
Spotify is a streaming music service that gives you on-demand access to millions of songs. Whether you like driving rock, silky R&B, or grandiose classical music, Spotify's massive catalogue... Read more
SlingPlayer Plugin 3.3.20.475 - Browser...
SlingPlayer is the screen interface software that works hand-in-hand with the hardware inside the Slingbox to make your TV viewing experience just like that at home. It features an array of... Read more
S.M.A.R.T. for USB and FireWire Drives 0...
S.M.A.R.T. for USB and FireWire Drives is a kernel driver for OS X external usb or firewire drives. It extends the standard driver behaviour by providing access to drive smart data. The interface to... Read more

Latest Forum Discussions

See All

148Apps Live on Twitch: Pivvot’s Looper...
On our latest Twitch stream, we’ll be playing a pair of minimalist arcade games, one that just got a big content update in Pivvot, and another that was inspired by it in 15 Coins. Whitaker Trebella, creator of Pivvot, will discuss the new modes... | Read more »
Word Cubes Review
Word Cubes Review By Jordan Minor on April 15th, 2014 Our Rating: :: SQUARESVILLEUniversal App - Designed for iPhone and iPad Word Cubes is fine, but it is barely any different from any other word game.   | Read more »
PAX East 2014 – Desert Fox: The Battle o...
PAX East 2014 – Desert Fox: The Battle of El Alamein is Coming to iOS Soon Posted by Rob Rich on April 15th, 2014 [ permalink ] Shenandoah Studio has become one of the go-to developers for war games on iOS, with | Read more »
Tank of Tanks Review
Tank of Tanks Review By Carter Dotson on April 15th, 2014 Our Rating: :: TANKS A LOT!iPad Only App - Designed for the iPad This multiplayer game played on a single iPad is simple, chaotic fun.   | Read more »
PAX East 2014 – Dungeon of the Endless J...
PAX East 2014 – Dungeon of the Endless Just Might Have a Shot at an iPad Release Posted by Rob Rich on April 15th, 2014 [ permalink ] I think it’s fair to say that | Read more »
SideSwype Review
SideSwype Review By Carter Dotson on April 15th, 2014 Our Rating: :: ON YOUR SIDEUniversal App - Designed for iPhone and iPad SideSwype is a puzzler that takes inspiration from Threes, but becomes its own incredibly fun game.   | Read more »
PAX East 2014 – Bigfoot Hunter Invites P...
PAX East 2014 – Bigfoot Hunter Invites Players on a Wild and Wooly Photo Safari Posted by Rob Rich on April 15th, 2014 [ permalink ] Yeti. Sasquatch. Wendigo. | Read more »
Dungeon Quest Review
Dungeon Quest Review By Cata Modorcea on April 15th, 2014 Our Rating: :: NO STORY, BUT GOOD FUNUniversal App - Designed for iPhone and iPad Dungeon Quest does a lot of things right, but ultimately forgets about one of the core... | Read more »
Tempo AI and Speek Join Forces to “Kill...
Tempo AI and Speek Join Forces to “Kill the Conference Call PIN” Posted by Rob Rich on April 15th, 2014 [ permalink ] Today Tempo AI, makers of Tempo Smart Calendar, and | Read more »
Fighting Fantasy: Starship Traveller Rev...
Fighting Fantasy: Starship Traveller Review By Jennifer Allen on April 15th, 2014 Our Rating: :: A SIGNIFICANT VOYAGEUniversal App - Designed for iPhone and iPad Continuing the release of Fighting Fantasy titles, Starship Traveller... | Read more »

Price Scanner via MacPrices.net

Save $50 on Mac mini Server
B&H Photo has the 2012 Mac mini Server on sale for $949 including free shipping plus NY sales tax only. Their price is $50 off MSRP. Read more
PhatWare’s “Ultimate Writing App For iOS” Ren...
PhatWare Corp. has announced it has renamed its new WritePro word processing app for iPhone and iPad: WritePad Pro. The decision to change the app’s name to leverages the strong brand awareness and... Read more
Full Resolution Photo Editor Tint Mint 1.0 Re...
California based independent developer, Jeffrey Sun, creator of the iOS app Modern Editor, has released Tint Mint, a new photography app for editing enthusiasts. The app costs a dollar, and it packs... Read more
16GB iPad mini (Apple refurbished) available...
The Apple Store has refurbished 1st generation 16GB iPad minis available for $249 including free shipping. Both black and white models are available. Read more
Save $120 on the 27-inch 3.2GHz Haswell iMac
B&H Photo has the 27″ 3.2GHz iMac on sale for $1679.99 including free shipping plus NY sales tax only. Their price is about $120 off MSRP. Read more
Using a Mac Doesn’t Eliminate The Heartbleed...
Low End Mac’s Dan Knight notes that any time you visit a website with an https: prefix or see that secure lock icon on your browser, some type of security software is busy trying to protect your data... Read more
AirPrint Basics Tutorial Posted
A new Apple Knowledge Base article helps get you started using AirPrint, the Apple protocol that enables instant printing from iPad, iPhone, iPod touch, and Mac without the need to install drivers or... Read more
Speed Tips For Running LibreOffice On The Mac
LibreOffice is my favorite of several free, open-source application suites, and the one I have configured on my Mac as my default app for Word documents that one frequently has to deal with. It also... Read more
Snag a 15-inch Retina MacBook Pro for $115 of...
B&H Photo has 2013 15″ Retina MacBook Pros on sale for up to $115 off MSRP for a limited time. Shipping is free, and B&H charges NY sales tax only: - 15″ 2.3GHz Retina MacBook Pro: $2489.99... Read more
MacBook Airs on sale for $50 to $100 off MSRP
Several resellers are offering $50-$100 discounts on 11″ and 13″ MacBook Airs today, including Amazon, Best Buy, B&H, and others. See the breakdown of deals on our MacBook Air Price Tracker,... Read more

Jobs Board

Position Opening at *Apple* - Apple (United...
**Job Summary** Every day, business customers come to the Apple Store to discover what powerful, easy-to-use Apple products can do for them. As a Business Leader, Read more
Position Opening at *Apple* - Apple (United...
…challenges of developing individuals, building teams, and affecting growth across Apple Stores. You demonstrate successful leadership ability - focusing on excellence Read more
Position Opening at *Apple* - Apple (United...
…Summary** As a Specialist, you help create the energy and excitement around Apple products, providing the right solutions and getting products into customers' hands. You Read more
Position Opening at *Apple* - Apple (United...
**Job Summary** The Apple Store is a retail environment like no other - uniquely focused on delivering amazing customer experiences. As an Expert, you introduce people Read more
Position Opening at *Apple* - Apple (United...
**Job Summary** Being a Business Manager at an Apple Store means you're the catalyst for businesses to discover and leverage the power, ease, and flexibility of Apple Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.