TweetFollow Us on Twitter

September 94 - BALANCE OF POWER

BALANCE OF POWER

Tuning PowerPCMemory Usage

DAVE EVANS

[IMAGE 017-019_Balance_of_Pwr_h1.GIF]

If you care about the performance of code you write for the Power Macintosh, memory usage should be your foremost concern. With the PowerPCTM601 processor today, and even more important with future processors, memory usage of your code will have the greatest effect on its performance. Poorly written code will execute at a fraction of its potential, and often very simple changes will greatly improve the execution speed of your critical code.

Processors are improving much faster than the memory subsystems that support them. As the PowerPC chips move from 80 MHz to 100 MHz and beyond, their thirst for data to process and instructions to execute will increasingly tax memory. Memory caches attempt to mitigate that thirst, and all PowerPC processors come equipped with built-in caches. But your code can work well with a cache or it can work very poorly with a cache. I'll show you why and discuss what you can do to optimize your memory usage.

CACHE BASICS
As you know, a cache is simply very fast RAM that the processor can access quickly and that it uses to store recently referenced data and code. On the PowerPC processors, any data stored in the cache can be accessed without stalling the processor's pipeline. Accesses to data not in the cache will take about 20 times as long reading from main memory, or even 1 million times as long if the access causes a page fault with virtual memory. Getting and keeping your performance-critical code and data in the cache are therefore key to your execution speed.

A cache is divided into small blocks calledcache lines . On the PowerPC 601, for example, the cache has 1024 cache line blocks, each holding 32 bytes. In addition, the 601 will fetch two blocks when it can, making the cache line size effectively 64 bytes.

The first PowerPC processors have set associative caches of different sizes. The 601 has an eight-way set associative, unified cache that's 32K in size, and the 603 has a two-way set associative, split cache with 8K for data and 8K for instruction code. The termset associative refers to the way the cache relates to main memory, which is important to your performance. In some simple caching schemes, each cache line maps directly to specific areas of main memory; any access to one of these areas loads bytes into that cache line. But on the PowerPC processor, sets of cache lines are combined and then mapped to memory. There are eight cache lines in each set on the 601, and two in each set on the 603. An access to one of the areas mapped to a set will load bytes into the last-used cache line of the set, keeping the most frequently used cache lines from being purged. This more complicated scheme typically yields much better performance than the directly mapped cache.

CACHE THRASHING
The cache will most affect your performance when you're accessing large amounts of data. A typical example of this is walking through arrays to perform some operation. The best strategy is to minimize cache collisions during your accesses, and the best tactic for this is to access your data as sequentially as possible. If you walk through memory sequentially, you'll load the cache every 64 bytes, but all 64 bytes will be available for fast processing. Here's an example:

unsigned longdata[64][1024];
for (row = 2; row < 64; row++)
    for (column = 0; column < 1024; column++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

This example performs additions on each element of a large matrix and accesses that matrix sequentially in memory. It walks across each row, adding elements and storing the result. But just inverting the loops can significantly change the way memory is accessed:

unsigned longdata[64][1024];
for (column = 0; column < 1024; column++)
    for (row = 2; row < 64; row++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

Reversing the loops leads to less than optimal performance since we perform each addition for all the columns before moving to the next element of a row. Instead of sequential access, this access pattern jumps across memory in even steps of 4K. Unfortunately, on the PowerPC processor these accesses map to the same set of cache lines, and every operation causes the cache to reload from main memory. This second example takes twice as long to execute the same calculation on a Power Macintosh 6100/60.

By paying attention to how your code accesses memory, you can avoid serious cache thrashing like that done by the second example. Things to look out for are loops that iterate for a power of 2 steps (128, 256, and so on)and code whose memory accesses are not close together.

An approach called blocking may help your loops. Often your code isn't as simple as above, and your memory accesses aren't regular during the loop. If you're walking two different arrays with different increments through memory, it may be impossible to serialize your accesses. Blocking performs the calculations in blocks of rows and columns. Instead of iterating across all the columns and then proceeding to the next row, you divide the dimensional space into blocks and calculate one whole block at a time. In this next example, we calculate the multiplication of two matrices.

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column++) {
        long    sum = 0;
        for (i = 0; i < 128; i++)
            sum += foo[row][i] * bar[i][column];
        result[row][column] = sum;
    }

As this algorithm walks through memory, it accesses result and foo sequentially, but bar is accessed in 256-byte steps. Accessing bar by jumping through memory causes cache misses, and sequential elements of bar are flushed from the cache before they're needed.

By performing this operation in small blocks, we can better use the cache. The key is to use all the elements of foo and bar that are in a cache line before moving on. One way to do this is to expand the loop and perform four operations in a single iteration:

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column += 4) {
        long    sum1 = 0, sum2 = 0;
        long    sum3 = 0, sum4 = 0;
        for (i = 0; i < 128; i++) {
            sum1 += foo[row][i] * bar[i][column];
            sum2 += foo[row][i] * bar[i][column+1];
            sum3 += foo[row][i] * bar[i][column+2];
            sum4 += foo[row][i] * bar[i][column+3];
    }
    result[row][column] = sum1;
    result[row][column+1] = sum2;
    result[row][column+2] = sum3;
    result[row][column+3] = sum4;
    }

This expanded loop calculates a block of four cells in each iteration. This executes faster because elements of bar are read from the cache and don't always cause cache misses as in the earlier example. Notice that in the expanded inner loop, a cache line of the bar matrix will be loaded the first time that it's referenced; then the following three references to bar will occur without stalling. Using the bar elements while they're still in the cache gives us a significant improvement.

CODE STYLE
Good compilers can pay attention to your memory accesses and will optimize how you access memory. For example, load and store operations can be reordered by the compiler to occur when the data is most likely to be available. The first time data is accessed it tends to cause a cache line to load, and subsequent accesses to nearby data must also wait for the cache load to complete. The compiler may be able to help by inserting a few instructions between the loads. This way the cache line will be fully loaded when the subsequent accesses are needed.

For more information on loop expansion and instruction reordering, see the Balance of Power column in develop Issue 18.*

You can help your compiler by using local variables when you can. These tell the compiler exactly how the data will be used, enabling it to easily reorder the loads and stores for this data.

You should also carefully note memory dereferences, especially double dereferences. Although it may be obvious to you, the compiler often can't tell whether two pointers address the same object in memory. The compiler may be prevented from reordering instructions because it can't tell whether two operations are really dependent on each other, just because they contain dereferences. Here's an example:

paramBlock->size = myStructure->size;
paramBlock->offset = myStructure->offset;

Although it appears obvious, the compiler usually can't tell if paramBlock references the same memory as myStructure. In the resulting binary, the compiler will be conservative and not reorder these operations for best execution. Replacing the dereference of myStructure with local variables for size and offset will allow the compiler to fully optimize this example.

INSTRUCTION THRASHING
Your code binary itself can cause the cache to thrash as it loads to be executed. This is very hard to detect and optimize. The basic problem is that your subroutines may map to the same areas of the cache, and frequent calls among them will stall to reload the cache. Some code profilers for RISC workstations have attempted to detect this problem, but for the Macintosh I can't suggest much help. Just changing the link order of your code and then executing profiles may have an effect; some link orders will thrash more than others.

DATA STRUCTURES
The layout of your data structures can greatly affect your cache usage and your memory usage in general. For example, memory accesses that cross 64-bit memory boundaries take twice as long to process, as this forces two bus transactions. On the PowerPC 601 processor, any misaligned data access within a memory boundary takes the standard amount of time, which (because of typical Macintosh data structures) is a valuable feature of the chip; future PowerPC processors, however, may take longer to access misaligned data. If you can align your data structures, do so now. A good tactic is to keep 64-bit data at the top of your structure, followed by your 32-bit data, and so on to prevent accidental misalignment of elements. Pad the end of the structure to an even 64-bit increment if you will have arrays of structures or will allocate them on the stack. And if certain parts of your structure are accessed much more often than other parts, keep these together so that they stay in the cache, and make sure they're aligned.

DON'T FORGET
The memory usage of your speed-critical code will greatly affect its performance today, and current problems will just get worse when PowerPC processors go above 100 MHz. Profile your code to find the most critical bottlenecks; then pay close attention to how that code addresses memory. You'll be rewarded with an excellent return on your investment.

DAVE EVANS occasionally uses the combinatorics skills he learned at the Massachusetts Institute of Technology, but more often he's been practicing his combination punches at a Thai kickboxing gym. Designing fast algorithms for Apple's OS Platforms Group is definitely rewarding, but developing a fast left hook really gets him pumped up. *

Thanks to Tom Adams, Mike Cappella, Rob Johnston, and Mike Neil for reviewing this column. *

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

VueScan 9.5.75 - Scanner software with a...
VueScan is a scanning program that works with most high-quality flatbed and film scanners to produce scans that have excellent color fidelity and color balance. VueScan is easy to use, and has... Read more
Opera 44.0.2510.1449 - High-performance...
Opera is a fast and secure browser trusted by millions of users. With the intuitive interface, Speed Dial and visual bookmarks for organizing favorite sites, news feature with fresh, relevant content... Read more
Opera 44.0.2510.1449 - High-performance...
Opera is a fast and secure browser trusted by millions of users. With the intuitive interface, Speed Dial and visual bookmarks for organizing favorite sites, news feature with fresh, relevant content... Read more
Skim 1.4.29 - PDF reader and note-taker...
Skim is a PDF reader and note-taker for OS X. It is designed to help you read and annotate scientific papers in PDF, but is also great for viewing any PDF file. Skim includes many features and has a... Read more
FontExplorer X Pro 6.0.2 - Font manageme...
FontExplorer X Pro is optimized for professional use; it's the solution that gives you the power you need to manage all your fonts. Now you can more easily manage, activate and organize your... Read more
1Password 6.7.1 - Powerful password mana...
1Password is a password manager that uniquely brings you both security and convenience. It is the only program that provides anti-phishing protection and goes beyond password management by adding Web... Read more
Vivaldi 1.9.818.44 - An advanced browser...
Vivaldi is a browser for our friends. In 1994, two programmers started working on a web browser. Our idea was to make a really fast browser, capable of running on limited hardware, keeping in mind... Read more
Vivaldi 1.9.818.44 - An advanced browser...
Vivaldi is a browser for our friends. In 1994, two programmers started working on a web browser. Our idea was to make a really fast browser, capable of running on limited hardware, keeping in mind... Read more
Skim 1.4.29 - PDF reader and note-taker...
Skim is a PDF reader and note-taker for OS X. It is designed to help you read and annotate scientific papers in PDF, but is also great for viewing any PDF file. Skim includes many features and has a... Read more
1Password 6.7.1 - Powerful password mana...
1Password is a password manager that uniquely brings you both security and convenience. It is the only program that provides anti-phishing protection and goes beyond password management by adding Web... Read more

Latest Forum Discussions

See All

Fire Emblem Heroes event announces new m...
As reported yesterday, Nintendo was gearing up a live press event for their popular mobile game,Fire Emblem Heroes. While the stream revealed a lot of new things, the event was entirely in Japanese. Luckily we have a rundown of what was announced... | Read more »
Best games we played this week
Another week, another slate of new mobile games. Although there weren't as many big name releases as last week, there were plenty of unique video game titles that came out that's sure to keep you interested over the weekend. Everything from classic... | Read more »
Olli by Tinrocket (Photography)
Olli by Tinrocket 1.0 Device: iOS iPhone Category: Photography Price: $2.99, Version: 1.0 (iTunes) Description: Get drawn in with Olli by TinrocketOlli instantly turns your everyday moments into hand-drawn art and animations. • Watch... | Read more »
Penarium (Games)
Penarium 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: | Read more »
Fire Emblem Heroes is way more profitabl...
Profits for Nintendo's mobile game Fire Emblem Heroes are apparently impressive enough to beat out other Nintendo titles likeSuper Mario Run, despite having 10 times fewer downloads. [Read more] | Read more »
Classic series Robot Unicorn Attack 3 no...
The classic Adult Swim browser game, Robot Unicorn Attack, branched off into a series of popular mobile games. Now, the latest entry into the series, Robot Unicorn Attack 3, is available for iOS and Android mobile devices. [Read more] | Read more »
Sudoku Sweeper (Games)
Sudoku Sweeper 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: A minimalist mashup of Minesweeper and Sudoku. Logic puzzle perfection. Every row, column and zone contains a bomb and one of... | Read more »
Under Leaves (Games)
Under Leaves 1.0.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0.0 (iTunes) Description: Journey into the forest, the jungle or the depths of the deep blue sea. Find chestnuts for the pigs, a caterpillar for the... | Read more »
Ninja Pizza Girl (Games)
Ninja Pizza Girl 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: In the not-so-distant future, rampart traffic congestion has resulted in only one way to deliver pizzas across town in thirty... | Read more »
SCRAP (Games)
SCRAP 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: That day, for no apparent reason, SCRAP decided to wake up and run. He had to, because his activation was a mistake the "Factory" could... | Read more »

Price Scanner via MacPrices.net

13-inch 2.7GHz Retina MacBook Pro, Apple refu...
Apple has Certified Refurbished 13″ 2.7GHz/128GB Retina MacBook Pros available for $200 off MSRP. An Apple one-year warranty is included with each model, and shipping is free: - 13″ 2.7GHz/128GB... Read more
13-inch Gray 2.9GHz/512GB Touch Bar MacBook P...
Amazon has the 13″ Space Gray 2.9GHz/512GB Touch Bar MacBook Pro (model MNQF2LL/A) in stock today and on sale for $150 off MSRP. Shipping is free: - 13″ 2.9GHz/512GB Touch Bar MacBook Pro Space Gray... Read more
15-inch 2.7GHz Space Gray Touch Bar MacBook P...
B&H Photo has the 15″ 2.7GHz Space Gray Touch Bar MacBook Pro in stock today and on sale for $2599…$200 off MSRP. Shipping is free, and B&H charges NY & NJ sales tax only: - 15″ 2.7GHz... Read more
13-inch 2.9GHz/256GB Space Gray Touch Bar Mac...
B&H Photo has the 13″ 2.9GHz/256GB Space Gray Touch Bar MacBook Pro in stock today and on sale for $150 off MSRP including free shipping plus NY & NJ sales tax only: - 13″ 2.9GHz/256GB Touch... Read more
21-inch iMacs on sale for up to $151 off MSRP
B&H Photo has 21″ iMacs on sale for up to $151 off MSRP, each including free shipping plus NY sales tax only: - 21″ 3.1GHz iMac 4K: $1348 $151 off MSRP - 21″ 2.8GHz iMac: $1199.99 $100 off MSRP... Read more
Weekend deal: Up to $420 off new MacBook Pros...
Apple has Certified Refurbished 2016 15″ and 13″ MacBook Pros available for $230 to $420 off original MSRP. An Apple one-year warranty is included with each model, and shipping is free: - 15″ 2.6GHz... Read more
Price drop: 15-inch 2.2GHz Retina MacBook Pro...
Amazon has dropped their price on 15″ 2.2GHz Retina MacBook Pros (MJLQ2LL/A) to $1709.99 including free shipping. Their price is $290 off MSRP for this model. Note that stock may sell out quickly at... Read more
2.8GHz Mac mini on sale for $899, save $100
B&H Photo has the 2.8GHz Mac mini (model number MGEQ2LL/A) on sale for $899 including free shipping plus NY & NJ sales tax only. Their price is $100 off MSRP. Read more
Check Apple prices on any device with the iTr...
MacPrices is proud to offer readers a free iOS app (iPhones, iPads, & iPod touch) and Android app (Google Play and Amazon App Store) called iTracx, which allows you to glance at today’s lowest... Read more
New System Clock for macOS by B-Eng Now Avail...
Fehraltorf, Switzerland based B-Eng has announced the release and immediate availability of System Clock, the company’s new system monitor and information app developed exclusively for macOS. System... Read more

Jobs Board

Product Manager, *Apple* Platforms - Viacom...
…Product Manager to drive the execution of its iOS and AppleTV experiences. The Apple Platform Product Manager will be a leader in our Agile/Scrum environment and Read more
*Apple* Mobile Master - Best Buy (United Sta...
**493714BR** **Job Title:** Apple Mobile Master **Location Number:** 001024-Weatherford-Store **Job Description:** **What does a Best Buy Apple Mobile Master Read more
*Apple* OS X Server Administrator (Active Se...
** Apple OS X Server Administrator \(Active Secret Clearance\)** **Description** Come be a part of a top notch team, apply today\!\! Tuva TUVA provides turnkey Read more
*Apple* Mac Computer Technician - GeekHampto...
…complex computer issues over the phone and in person? GeekHampton, Long Island's Apple Premium Service Provider, is looking for you! Come work with our crew Read more
Best Buy *Apple* Computing Master - Best Bu...
**501846BR** **Job Title:** Best Buy Apple Computing Master **Location Number:** 001126-South Bay Center-Store **Job Description:** **What does a Best Buy Apple Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.