TweetFollow Us on Twitter

September 94 - BALANCE OF POWER


Tuning PowerPCMemory Usage


[IMAGE 017-019_Balance_of_Pwr_h1.GIF]

If you care about the performance of code you write for the Power Macintosh, memory usage should be your foremost concern. With the PowerPCTM601 processor today, and even more important with future processors, memory usage of your code will have the greatest effect on its performance. Poorly written code will execute at a fraction of its potential, and often very simple changes will greatly improve the execution speed of your critical code.

Processors are improving much faster than the memory subsystems that support them. As the PowerPC chips move from 80 MHz to 100 MHz and beyond, their thirst for data to process and instructions to execute will increasingly tax memory. Memory caches attempt to mitigate that thirst, and all PowerPC processors come equipped with built-in caches. But your code can work well with a cache or it can work very poorly with a cache. I'll show you why and discuss what you can do to optimize your memory usage.

As you know, a cache is simply very fast RAM that the processor can access quickly and that it uses to store recently referenced data and code. On the PowerPC processors, any data stored in the cache can be accessed without stalling the processor's pipeline. Accesses to data not in the cache will take about 20 times as long reading from main memory, or even 1 million times as long if the access causes a page fault with virtual memory. Getting and keeping your performance-critical code and data in the cache are therefore key to your execution speed.

A cache is divided into small blocks calledcache lines . On the PowerPC 601, for example, the cache has 1024 cache line blocks, each holding 32 bytes. In addition, the 601 will fetch two blocks when it can, making the cache line size effectively 64 bytes.

The first PowerPC processors have set associative caches of different sizes. The 601 has an eight-way set associative, unified cache that's 32K in size, and the 603 has a two-way set associative, split cache with 8K for data and 8K for instruction code. The termset associative refers to the way the cache relates to main memory, which is important to your performance. In some simple caching schemes, each cache line maps directly to specific areas of main memory; any access to one of these areas loads bytes into that cache line. But on the PowerPC processor, sets of cache lines are combined and then mapped to memory. There are eight cache lines in each set on the 601, and two in each set on the 603. An access to one of the areas mapped to a set will load bytes into the last-used cache line of the set, keeping the most frequently used cache lines from being purged. This more complicated scheme typically yields much better performance than the directly mapped cache.

The cache will most affect your performance when you're accessing large amounts of data. A typical example of this is walking through arrays to perform some operation. The best strategy is to minimize cache collisions during your accesses, and the best tactic for this is to access your data as sequentially as possible. If you walk through memory sequentially, you'll load the cache every 64 bytes, but all 64 bytes will be available for fast processing. Here's an example:

unsigned longdata[64][1024];
for (row = 2; row < 64; row++)
    for (column = 0; column < 1024; column++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

This example performs additions on each element of a large matrix and accesses that matrix sequentially in memory. It walks across each row, adding elements and storing the result. But just inverting the loops can significantly change the way memory is accessed:

unsigned longdata[64][1024];
for (column = 0; column < 1024; column++)
    for (row = 2; row < 64; row++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

Reversing the loops leads to less than optimal performance since we perform each addition for all the columns before moving to the next element of a row. Instead of sequential access, this access pattern jumps across memory in even steps of 4K. Unfortunately, on the PowerPC processor these accesses map to the same set of cache lines, and every operation causes the cache to reload from main memory. This second example takes twice as long to execute the same calculation on a Power Macintosh 6100/60.

By paying attention to how your code accesses memory, you can avoid serious cache thrashing like that done by the second example. Things to look out for are loops that iterate for a power of 2 steps (128, 256, and so on)and code whose memory accesses are not close together.

An approach called blocking may help your loops. Often your code isn't as simple as above, and your memory accesses aren't regular during the loop. If you're walking two different arrays with different increments through memory, it may be impossible to serialize your accesses. Blocking performs the calculations in blocks of rows and columns. Instead of iterating across all the columns and then proceeding to the next row, you divide the dimensional space into blocks and calculate one whole block at a time. In this next example, we calculate the multiplication of two matrices.

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column++) {
        long    sum = 0;
        for (i = 0; i < 128; i++)
            sum += foo[row][i] * bar[i][column];
        result[row][column] = sum;

As this algorithm walks through memory, it accesses result and foo sequentially, but bar is accessed in 256-byte steps. Accessing bar by jumping through memory causes cache misses, and sequential elements of bar are flushed from the cache before they're needed.

By performing this operation in small blocks, we can better use the cache. The key is to use all the elements of foo and bar that are in a cache line before moving on. One way to do this is to expand the loop and perform four operations in a single iteration:

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column += 4) {
        long    sum1 = 0, sum2 = 0;
        long    sum3 = 0, sum4 = 0;
        for (i = 0; i < 128; i++) {
            sum1 += foo[row][i] * bar[i][column];
            sum2 += foo[row][i] * bar[i][column+1];
            sum3 += foo[row][i] * bar[i][column+2];
            sum4 += foo[row][i] * bar[i][column+3];
    result[row][column] = sum1;
    result[row][column+1] = sum2;
    result[row][column+2] = sum3;
    result[row][column+3] = sum4;

This expanded loop calculates a block of four cells in each iteration. This executes faster because elements of bar are read from the cache and don't always cause cache misses as in the earlier example. Notice that in the expanded inner loop, a cache line of the bar matrix will be loaded the first time that it's referenced; then the following three references to bar will occur without stalling. Using the bar elements while they're still in the cache gives us a significant improvement.

Good compilers can pay attention to your memory accesses and will optimize how you access memory. For example, load and store operations can be reordered by the compiler to occur when the data is most likely to be available. The first time data is accessed it tends to cause a cache line to load, and subsequent accesses to nearby data must also wait for the cache load to complete. The compiler may be able to help by inserting a few instructions between the loads. This way the cache line will be fully loaded when the subsequent accesses are needed.

For more information on loop expansion and instruction reordering, see the Balance of Power column in develop Issue 18.*

You can help your compiler by using local variables when you can. These tell the compiler exactly how the data will be used, enabling it to easily reorder the loads and stores for this data.

You should also carefully note memory dereferences, especially double dereferences. Although it may be obvious to you, the compiler often can't tell whether two pointers address the same object in memory. The compiler may be prevented from reordering instructions because it can't tell whether two operations are really dependent on each other, just because they contain dereferences. Here's an example:

paramBlock->size = myStructure->size;
paramBlock->offset = myStructure->offset;

Although it appears obvious, the compiler usually can't tell if paramBlock references the same memory as myStructure. In the resulting binary, the compiler will be conservative and not reorder these operations for best execution. Replacing the dereference of myStructure with local variables for size and offset will allow the compiler to fully optimize this example.

Your code binary itself can cause the cache to thrash as it loads to be executed. This is very hard to detect and optimize. The basic problem is that your subroutines may map to the same areas of the cache, and frequent calls among them will stall to reload the cache. Some code profilers for RISC workstations have attempted to detect this problem, but for the Macintosh I can't suggest much help. Just changing the link order of your code and then executing profiles may have an effect; some link orders will thrash more than others.

The layout of your data structures can greatly affect your cache usage and your memory usage in general. For example, memory accesses that cross 64-bit memory boundaries take twice as long to process, as this forces two bus transactions. On the PowerPC 601 processor, any misaligned data access within a memory boundary takes the standard amount of time, which (because of typical Macintosh data structures) is a valuable feature of the chip; future PowerPC processors, however, may take longer to access misaligned data. If you can align your data structures, do so now. A good tactic is to keep 64-bit data at the top of your structure, followed by your 32-bit data, and so on to prevent accidental misalignment of elements. Pad the end of the structure to an even 64-bit increment if you will have arrays of structures or will allocate them on the stack. And if certain parts of your structure are accessed much more often than other parts, keep these together so that they stay in the cache, and make sure they're aligned.

The memory usage of your speed-critical code will greatly affect its performance today, and current problems will just get worse when PowerPC processors go above 100 MHz. Profile your code to find the most critical bottlenecks; then pay close attention to how that code addresses memory. You'll be rewarded with an excellent return on your investment.

DAVE EVANS occasionally uses the combinatorics skills he learned at the Massachusetts Institute of Technology, but more often he's been practicing his combination punches at a Thai kickboxing gym. Designing fast algorithms for Apple's OS Platforms Group is definitely rewarding, but developing a fast left hook really gets him pumped up. *

Thanks to Tom Adams, Mike Cappella, Rob Johnston, and Mike Neil for reviewing this column. *


Community Search:
MacTech Search:

Software Updates via MacUpdate

Backblaze - Online backup serv...
Backblaze is an online backup service designed from the ground-up for the Mac. With unlimited storage available for $5 per month, as well as a free 15-day trial, peace of mind is within reach with... Read more
Postbox 5.0.5 - Powerful and flexible em...
Postbox is a new email application that helps you organize your work life and get stuff done. It has all the elegance and simplicity of Apple Mail, but with more power and flexibility to manage even... Read more
Coda 2.5.19 - One-window Web development...
Coda is a powerful Web editor that puts everything in one place. An editor. Terminal. CSS. Files. With Coda 2, we went beyond expectations. With loads of new, much-requested features, a few surprises... Read more
Toast Titanium 15.1 - $99.99
Roxio Toast 15 Titanium, the leading DVD burner for Mac, makes burning even better, adding Roxio Secure Burn to protect your files on disc and USB in Mac- or Windows-compatible formats. Get more... Read more
Firetask 3.8.1 - Innovative task managem...
Firetask uniquely combines the advantages of classical priority-and-due-date-based task management with GTD. Stay focused and on top of your commitments - Firetask's "Today" view shows all relevant... Read more
Chromium 54.0.2840.71 - Fast and stable...
Chromium is an open-source browser project that aims to build a safer, faster, and more stable way for all Internet users to experience the web. Version 54.0.2840.71: Release notes were unavailable... Read more
Chromium 54.0.2840.71 - Fast and stable...
Chromium is an open-source browser project that aims to build a safer, faster, and more stable way for all Internet users to experience the web. Version 54.0.2840.71: Release notes were unavailable... Read more
Firetask 3.8.1 - Innovative task managem...
Firetask uniquely combines the advantages of classical priority-and-due-date-based task management with GTD. Stay focused and on top of your commitments - Firetask's "Today" view shows all relevant... Read more
Yep 3.8.0 - $23.99
Yep is a document organization and management tool. Like iTunes for music or iPhoto for photos, Yep lets you search and view your documents in a comfortable interface, while offering the ability to... Read more
Data Rescue 4.3.1 - Powerful hard drive...
Use Data Rescue to recover: crashed, corrupted or non-mounting hard drive deleted, damaged, or lost files reformatted or erased hard drive One powerful new feature found in Data Rescue 4 is... Read more

Latest Forum Discussions

See All

WitchSpring2 (Games)
WitchSpring2 1.27 Device: iOS Universal Category: Games Price: $3.99, Version: 1.27 (iTunes) Description: This is the story of Luna, the Moonlight Witch as she sets out into the world. This is a sequel to Witch Spring. Witch Spring 2... | Read more »
Best Fiends Forever Guide: How to collec...
The fiendship in Seriously's hit Best Fiends has been upgraded this time around in Best Fiends Forever. It’s a fast-paced clicker with lots of color and style--kind of reminiscent of a ‘90s animal mascot game like Crash Bandicoot. The game... | Read more »
5 apps for the budding mixologist
Creating your own cocktails is something of an art form, requiring a knack for unique tastes and devising interesting combinations. It's easy to get started right in your own kitchen, though, even if you're a complete beginner. Try using one of... | Read more »
5 mobile strategy games to try when you...
Strategy enthusiasts everywhere are celebrating the release of Civilization VI this week, and so far everyone seems pretty satisfied with the first full release in the series since 2010. The series has always been about ultra-addictive gameplay... | Read more »
Popclaire talk to us about why The Virus...
Humanity has succumbed to a virus that’s spread throughout the world. Now the dead have risen with a hunger for human flesh, and all that remain are a few survivors. One of those survivors has just called you for help. That’s the plot in POPCLAIRE’... | Read more »
Oceans & Empires preview build sets...
Hugely ambitious sea battler Oceans & Empires is available to play in preview form now on Google Play - but download it quickly, as it’s setting sail away in just a few days. [Read more] | Read more »
Rusty Lake: Roots (Games)
Rusty Lake: Roots 1.1.4 Device: iOS Universal Category: Games Price: $2.99, Version: 1.1.4 (iTunes) Description: James Vanderboom's life drastically changes when he plants a special seed in the garden of the house he has inherited.... | Read more »
Flippy Bottle Extreme! and 3 other physi...
Flippy Bottle Extreme! takes on the bottle flipping craze with a bunch of increasingly tricky physics platforming puzzles. It's difficult and highly frustrating, but also addictive. When you begin to master the game, the sense of achievement is... | Read more »
Plants vs. Zombies Heroes guide: How to...
Plants vs. Zombies Heroes surprised us all, presenting a deep deck building experience. It's a great CCG that stands up well to the competition. There are a lot of CCGs vying for players' attention at the moment, but PvZ Heroes is definitely one... | Read more »
Arcane Online takes Online RPG’s to anot...
If you think that you need a desktop to enjoy high quality MMO gaming then Arcane Online hopes to prove you emphatically wrong. An epic fantasy Online RPG set in the land of Eldine, Arcane Online offers an abundance of features and content that... | Read more »

Price Scanner via

Apple’s Thursday “Hello Again” Event A Largel...
KGI Securities analyst Ming-Chi Kuo, who has a strong record of Apple hardware prediction accuracy, forecasts in a new note to investors released late last week that a long-overdue redo of the... Read more
12-inch Retina MacBooks on sale for $100 off...
Amazon has 2016 12″ Apple Retina MacBooks on sale for $100 off MSRP. Shipping is free: - 12″ 1.1GHz Silver Retina MacBook: $1199.99 $100 off MSRP - 12″ 1.1GHz Gold Retina MacBook: $1199.99 $100 off... Read more
Save up to $600 with Apple refurbished Mac Pr...
Apple has Certified Refurbished Mac Pros available for up to $600 off the cost of new models. An Apple one-year warranty is included with each Mac Pro, and shipping is free. The following... Read more
PixelStyle Inexpensive Photo Editor For Mac W...
PixelStyle is an all-in-one Mac Photo Editor with a huge range of high-end filters including lighting, blurs, distortions, tilt-shift, shadows, glows and so forth. PixelStyle Photo Editor for Mac... Read more
13-inch MacBook Airs on sale for $100-$140 of...
B&H has 13″ MacBook Airs on sale for $100-$140 off MSRP for a limited time. Shipping is free, and B&H charges NY sales tax only: - 13″ 1.6GHz/128GB MacBook Air (sku MMGF2LL/A): $899 $100 off... Read more
2.8GHz Mac mini available for $988, includes...
Adorama has the 2.8GHz Mac mini available for $988, $11 off MSRP, including a free copy of Apple’s 3-Year AppleCare Protection Plan. Shipping is free, and Adorama charges sales tax in NY & NJ... Read more
21-inch 3.1GHz 4K on sale for $1379, $120 off...
Adorama has the 21″ 3.1GHz 4K iMac on sale $1379.99. Shipping is free, and Adorama charges NY & NJ sales tax only. Their price is $120 off MSRP. To purchase an iMac at this price, you must first... Read more
Check Apple prices on any device with the iTr...
MacPrices is proud to offer readers a free iOS app (iPhones, iPads, & iPod touch) and Android app (Google Play and Amazon App Store) called iTracx, which allows you to glance at today’s lowest... Read more
Apple, Samsung, Lead J.D. Power Smartphone Sa...
Customer satisfaction is much higher among smartphone owners currently subscribing to full-service wireless carriers, compared with those purchasing service through a non-contract carrier, according... Read more
Select 9-inch Apple WiFi iPad Pros on sale fo...
B&H Photo has select 9.7″ Apple WiFi iPad Pros on sale for up to $50 off MSRP, each including free shipping. B&H charges sales tax in NY only: - 9″ Space Gray 256GB WiFi iPad Pro: $799 $0 off... Read more

Jobs Board

*Apple* Retail - Multiple Positions- Napervi...
Job Description:SalesSpecialist - Retail Customer Service and SalesTransform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
Security Data Analyst - *Apple* Information...
…data sources need to be collected to allow Information Security to better protect Apple employees and customers from a wide range of threats.Act as the subject Read more
*Apple* Retail - Multiple Positions (Multi-L...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
*Apple* Retail - Multiple Positions- New Yor...
Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the Read more
*Apple* Retail - Multiple Positions- Yonkers...
Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.