TweetFollow Us on Twitter

September 94 - BALANCE OF POWER

BALANCE OF POWER

Tuning PowerPCMemory Usage

DAVE EVANS

[IMAGE 017-019_Balance_of_Pwr_h1.GIF]

If you care about the performance of code you write for the Power Macintosh, memory usage should be your foremost concern. With the PowerPCTM601 processor today, and even more important with future processors, memory usage of your code will have the greatest effect on its performance. Poorly written code will execute at a fraction of its potential, and often very simple changes will greatly improve the execution speed of your critical code.

Processors are improving much faster than the memory subsystems that support them. As the PowerPC chips move from 80 MHz to 100 MHz and beyond, their thirst for data to process and instructions to execute will increasingly tax memory. Memory caches attempt to mitigate that thirst, and all PowerPC processors come equipped with built-in caches. But your code can work well with a cache or it can work very poorly with a cache. I'll show you why and discuss what you can do to optimize your memory usage.

CACHE BASICS
As you know, a cache is simply very fast RAM that the processor can access quickly and that it uses to store recently referenced data and code. On the PowerPC processors, any data stored in the cache can be accessed without stalling the processor's pipeline. Accesses to data not in the cache will take about 20 times as long reading from main memory, or even 1 million times as long if the access causes a page fault with virtual memory. Getting and keeping your performance-critical code and data in the cache are therefore key to your execution speed.

A cache is divided into small blocks calledcache lines . On the PowerPC 601, for example, the cache has 1024 cache line blocks, each holding 32 bytes. In addition, the 601 will fetch two blocks when it can, making the cache line size effectively 64 bytes.

The first PowerPC processors have set associative caches of different sizes. The 601 has an eight-way set associative, unified cache that's 32K in size, and the 603 has a two-way set associative, split cache with 8K for data and 8K for instruction code. The termset associative refers to the way the cache relates to main memory, which is important to your performance. In some simple caching schemes, each cache line maps directly to specific areas of main memory; any access to one of these areas loads bytes into that cache line. But on the PowerPC processor, sets of cache lines are combined and then mapped to memory. There are eight cache lines in each set on the 601, and two in each set on the 603. An access to one of the areas mapped to a set will load bytes into the last-used cache line of the set, keeping the most frequently used cache lines from being purged. This more complicated scheme typically yields much better performance than the directly mapped cache.

CACHE THRASHING
The cache will most affect your performance when you're accessing large amounts of data. A typical example of this is walking through arrays to perform some operation. The best strategy is to minimize cache collisions during your accesses, and the best tactic for this is to access your data as sequentially as possible. If you walk through memory sequentially, you'll load the cache every 64 bytes, but all 64 bytes will be available for fast processing. Here's an example:

unsigned longdata[64][1024];
for (row = 2; row < 64; row++)
    for (column = 0; column < 1024; column++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

This example performs additions on each element of a large matrix and accesses that matrix sequentially in memory. It walks across each row, adding elements and storing the result. But just inverting the loops can significantly change the way memory is accessed:

unsigned longdata[64][1024];
for (column = 0; column < 1024; column++)
    for (row = 2; row < 64; row++)
        data[row][column] =
            data[row-1][column] + data[row-2][column];

Reversing the loops leads to less than optimal performance since we perform each addition for all the columns before moving to the next element of a row. Instead of sequential access, this access pattern jumps across memory in even steps of 4K. Unfortunately, on the PowerPC processor these accesses map to the same set of cache lines, and every operation causes the cache to reload from main memory. This second example takes twice as long to execute the same calculation on a Power Macintosh 6100/60.

By paying attention to how your code accesses memory, you can avoid serious cache thrashing like that done by the second example. Things to look out for are loops that iterate for a power of 2 steps (128, 256, and so on)and code whose memory accesses are not close together.

An approach called blocking may help your loops. Often your code isn't as simple as above, and your memory accesses aren't regular during the loop. If you're walking two different arrays with different increments through memory, it may be impossible to serialize your accesses. Blocking performs the calculations in blocks of rows and columns. Instead of iterating across all the columns and then proceeding to the next row, you divide the dimensional space into blocks and calculate one whole block at a time. In this next example, we calculate the multiplication of two matrices.

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column++) {
        long    sum = 0;
        for (i = 0; i < 128; i++)
            sum += foo[row][i] * bar[i][column];
        result[row][column] = sum;
    }

As this algorithm walks through memory, it accesses result and foo sequentially, but bar is accessed in 256-byte steps. Accessing bar by jumping through memory causes cache misses, and sequential elements of bar are flushed from the cache before they're needed.

By performing this operation in small blocks, we can better use the cache. The key is to use all the elements of foo and bar that are in a cache line before moving on. One way to do this is to expand the loop and perform four operations in a single iteration:

long result[64][64], foo[64][128], bar[128][64];
for (row = 0; row < 64; row++)
    for (column = 0; column < 64; column += 4) {
        long    sum1 = 0, sum2 = 0;
        long    sum3 = 0, sum4 = 0;
        for (i = 0; i < 128; i++) {
            sum1 += foo[row][i] * bar[i][column];
            sum2 += foo[row][i] * bar[i][column+1];
            sum3 += foo[row][i] * bar[i][column+2];
            sum4 += foo[row][i] * bar[i][column+3];
    }
    result[row][column] = sum1;
    result[row][column+1] = sum2;
    result[row][column+2] = sum3;
    result[row][column+3] = sum4;
    }

This expanded loop calculates a block of four cells in each iteration. This executes faster because elements of bar are read from the cache and don't always cause cache misses as in the earlier example. Notice that in the expanded inner loop, a cache line of the bar matrix will be loaded the first time that it's referenced; then the following three references to bar will occur without stalling. Using the bar elements while they're still in the cache gives us a significant improvement.

CODE STYLE
Good compilers can pay attention to your memory accesses and will optimize how you access memory. For example, load and store operations can be reordered by the compiler to occur when the data is most likely to be available. The first time data is accessed it tends to cause a cache line to load, and subsequent accesses to nearby data must also wait for the cache load to complete. The compiler may be able to help by inserting a few instructions between the loads. This way the cache line will be fully loaded when the subsequent accesses are needed.

For more information on loop expansion and instruction reordering, see the Balance of Power column in develop Issue 18.*

You can help your compiler by using local variables when you can. These tell the compiler exactly how the data will be used, enabling it to easily reorder the loads and stores for this data.

You should also carefully note memory dereferences, especially double dereferences. Although it may be obvious to you, the compiler often can't tell whether two pointers address the same object in memory. The compiler may be prevented from reordering instructions because it can't tell whether two operations are really dependent on each other, just because they contain dereferences. Here's an example:

paramBlock->size = myStructure->size;
paramBlock->offset = myStructure->offset;

Although it appears obvious, the compiler usually can't tell if paramBlock references the same memory as myStructure. In the resulting binary, the compiler will be conservative and not reorder these operations for best execution. Replacing the dereference of myStructure with local variables for size and offset will allow the compiler to fully optimize this example.

INSTRUCTION THRASHING
Your code binary itself can cause the cache to thrash as it loads to be executed. This is very hard to detect and optimize. The basic problem is that your subroutines may map to the same areas of the cache, and frequent calls among them will stall to reload the cache. Some code profilers for RISC workstations have attempted to detect this problem, but for the Macintosh I can't suggest much help. Just changing the link order of your code and then executing profiles may have an effect; some link orders will thrash more than others.

DATA STRUCTURES
The layout of your data structures can greatly affect your cache usage and your memory usage in general. For example, memory accesses that cross 64-bit memory boundaries take twice as long to process, as this forces two bus transactions. On the PowerPC 601 processor, any misaligned data access within a memory boundary takes the standard amount of time, which (because of typical Macintosh data structures) is a valuable feature of the chip; future PowerPC processors, however, may take longer to access misaligned data. If you can align your data structures, do so now. A good tactic is to keep 64-bit data at the top of your structure, followed by your 32-bit data, and so on to prevent accidental misalignment of elements. Pad the end of the structure to an even 64-bit increment if you will have arrays of structures or will allocate them on the stack. And if certain parts of your structure are accessed much more often than other parts, keep these together so that they stay in the cache, and make sure they're aligned.

DON'T FORGET
The memory usage of your speed-critical code will greatly affect its performance today, and current problems will just get worse when PowerPC processors go above 100 MHz. Profile your code to find the most critical bottlenecks; then pay close attention to how that code addresses memory. You'll be rewarded with an excellent return on your investment.

DAVE EVANS occasionally uses the combinatorics skills he learned at the Massachusetts Institute of Technology, but more often he's been practicing his combination punches at a Thai kickboxing gym. Designing fast algorithms for Apple's OS Platforms Group is definitely rewarding, but developing a fast left hook really gets him pumped up. *

Thanks to Tom Adams, Mike Cappella, Rob Johnston, and Mike Neil for reviewing this column. *

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Pinegrow Web Designer 2.94 - Mockup and...
Pinegrow Web Designer is desktop app that lets you mockup and design webpages faster with multi-page editing, CSS and LESS styling, and smart components for Bootstrap, Foundation, Angular JS, and... Read more
ExpanDrive 5.4.1 - Access cloud storage...
ExpanDrive builds cloud storage in every application, acts just like a USB drive plugged into your Mac. With ExpanDrive, you can securely access any remote file server directly from the Finder or... Read more
MacUpdate Desktop 6.1.3 - Search and ins...
MacUpdate Desktop 6 brings seamless 1-click app installs and version updates to your Mac. With a free MacUpdate account and MacUpdate Desktop 6, Mac users can now install almost any Mac app on... Read more
VOX 2.8.6 - Music player that supports m...
VOX just sounds better! The beauty is in its simplicity, yet behind the minimal exterior lies a powerful music player with a ton of features and support for all audio formats you should ever need.... Read more
Espionage 3.6.6 - Simple, state-of-the-a...
Espionage offers state-of-the-art encryption and plausible deniability for your confidential data. Sometimes, encrypting your data isn't enough to protect it. That's why Espionage 3 goes beyond data... Read more
VOX 2.8.6 - Music player that supports m...
VOX just sounds better! The beauty is in its simplicity, yet behind the minimal exterior lies a powerful music player with a ton of features and support for all audio formats you should ever need.... Read more
MacUpdate Desktop 6.1.3 - Search and ins...
MacUpdate Desktop 6 brings seamless 1-click app installs and version updates to your Mac. With a free MacUpdate account and MacUpdate Desktop 6, Mac users can now install almost any Mac app on... Read more
ExpanDrive 5.4.1 - Access cloud storage...
ExpanDrive builds cloud storage in every application, acts just like a USB drive plugged into your Mac. With ExpanDrive, you can securely access any remote file server directly from the Finder or... Read more
Espionage 3.6.6 - Simple, state-of-the-a...
Espionage offers state-of-the-art encryption and plausible deniability for your confidential data. Sometimes, encrypting your data isn't enough to protect it. That's why Espionage 3 goes beyond data... Read more
Pinegrow Web Designer 2.94 - Mockup and...
Pinegrow Web Designer is desktop app that lets you mockup and design webpages faster with multi-page editing, CSS and LESS styling, and smart components for Bootstrap, Foundation, Angular JS, and... Read more

Infinite Tanks (Games)
Infinite Tanks 1.0.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0.0 (iTunes) Description: | Read more »
Agatha Christie - The ABC Murders (FULL)...
Agatha Christie - The ABC Murders (FULL) 1.0 Device: iOS Universal Category: Games Price: $6.99, Version: 1.0 (iTunes) Description: Agatha Christie: The ABC Murders Your weapon is your knowledge. Your wits will be put to the ultimate... | Read more »
HeadlessD (Games)
HeadlessD 1.0 Device: iOS Universal Category: Games Price: $.99, Version: 1.0 (iTunes) Description: HeadlessD is hand-painted dungeon crawler with intuitive touch controls and NO in-app purchases. | Read more »
Leaf for Twitter (Social Networking)
Leaf for Twitter 1.0.1 Device: iOS iPhone Category: Social Networking Price: $4.99, Version: 1.0.1 (iTunes) Description: | Read more »
Banner Saga 2 (Games)
Banner Saga 2 1.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: The epic award winning story-based role-playing game continues its emotional journey across a breaking world. Lead your Viking... | Read more »
Concrete Jungle (Games)
Concrete Jungle 1.16 Device: iOS Universal Category: Games Price: $4.99, Version: 1.16 (iTunes) Description: A follow up to the puzzle hit 'MegaCity'! Concrete Jungle is a new take on the city building genre that swaps micro-... | Read more »
5 great apps for the budget traveller
Travelling abroad, or even within your home country, has never been easier thanks to our handy smartphone companions. There are hundreds of apps on the market that promise to make your world journeys hassle-free, but we've selected five of the... | Read more »
Zip—Zap (Games)
Zip—Zap 1.01 Device: iOS Universal Category: Games Price: $1.99, Version: 1.01 (iTunes) Description: Touch to contract.Release to let go.Bring the clumsy mechanical beings home. · · · over 100 levelsno adsno in-app-purchases Zip—... | Read more »
Paperback: The Game (Games)
Paperback: The Game 1.0 Device: iOS Universal Category: Games Price: $3.99, Version: 1.0 (iTunes) Description: You are an author trying to finish kitschy paperback novels. Complete Westerns, Science Fiction, Romance or even a Crime... | Read more »
How to Rule With a Firm Hand in My Majes...
My Majesty is a kingdom management sim not unlike August’s magisterial hit, Reigns. It’s essentially a reskin of developer Tigrido’s previous management sim, Dictator. As supreme ruler of the land, you must consult with a number of subjects to... | Read more »

Price Scanner via MacPrices.net

Apple refurbished Mac minis available startin...
Apple has Certified Refurbished Mac minis available starting at $419. Apple’s one-year warranty is included with each mini, and shipping is free: - 1.4GHz Mac mini: $419 $80 off MSRP - 2.6GHz Mac... Read more
13-inch 2.5GHz MacBook Pro available for $928...
Overstock has the 13″ 2.5GHz MacBook Pro available for $927.99 including free shipping. Their price is $171 off MSRP. Read more
Buying McLaren Would Give Apple Instant Car C...
Apple “iCar” rumors have waxed and waned over the years, piquing interest and speculation as to whether Apple is seriously interested in getting into the automotobile business, either in a joint... Read more
Aetna to Transform Members’ Consumer Health E...
Health care benefits company Aetna, which has an estimated 46.3 million clients, today announced a new initiative to revolutionize members consumer health experience by combining the power of iOS... Read more
USB-IF Announces USB Audio Device Class 3.0 S...
USB Implementers Forum (USB-IF), the support organization for the advancement and adoption of USB technology, today announced the USB Audio Device Class 3.0 specification to establish USB Audio over... Read more
Clearance 12-inch 1.2GHz Retina MacBooks, App...
Apple has Certified Refurbished 2015 12″ 1.2GHz Retina MacBooks available for $1189, or $410 off original MSRP. Apple will include a standard one-year warranty with each MacBook, and shipping is free... Read more
Logitech SmartDock and Skype For Business Com...
Logitech has announced Logitech SmartDock, an AV meeting room solution designed in collaboration with Microsoft. Logitech SmartDock works with Skype for Business and qualified devices, including... Read more
27-inch iMacs on sale for up to $220 off MSRP
B&H Photo has 27″ Apple iMacs on sale for up to $200 off MSRP including free shipping plus NY sales tax only: - 27″ 3.3GHz iMac 5K: $2099 $200 off MSRP - 27″ 3.2GHz/1TB Fusion iMac 5K: $1899.99 $... Read more
Apple Macs and iPads available for up to $300...
Purchase a new Mac or iPad using Apple’s Education Store and take up to $300 off MSRP. All teachers, students, and staff of any educational institution qualify for the discount. Shipping is free, and... Read more
Save up to $600 with Apple refurbished Mac Pr...
Apple has Certified Refurbished Mac Pros available for up to $600 off the cost of new models. An Apple one-year warranty is included with each Mac Pro, and shipping is free. The following... Read more

Jobs Board

*Apple* Retail - Multiple Positions- Chicago...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
*Apple* Retail - Multiple Positions- Raleigh...
Job Description:SalesSpecialist - Retail Customer Service and SalesTransform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
User Support Specialist *Apple* Product Spe...
…Description:Ciber, Inc. is seeking a User Support Specialist - Apple Product Support in Nashville, TN!Responsibilities:Support, implementation, and upgrade of Read more
Restaurant Manager (Neighborhood Captain) - A...
…in every aspect of daily operation. WHY YOU'LL LIKE IT: You'll be the Big Apple . You'll solve problems. You'll get to show your ability to handle the stress and Read more
US- *Apple* Store Leader Program - Apple (Un...
…Summary Learn and grow as you explore the art of leadership at the Apple Store. You'll master our retail business inside and out through training, hands-on Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.