TweetFollow Us on Twitter

December 95 - Balance Of Power: Advanced Performance Profiling

Balance Of Power: Advanced Performance Profiling

Dale Evans

There's little that compares to diving headfirst toward the ground at 120 miles per hour. I may have been going even faster when I last went skydiving. Tucking my arms in tightly, with my head back and legs even, I heard a deafening roar from the wind as I sped toward terminal velocity. "Terminal" would have been a good word for the situation if it weren't for the advances that have been made in parachute technology.

Parachutes have come a long way since their debut, when they were billowy round disks of silk sewn with simple cords stretching to a harness. They were greatly improved when the square parachute was invented thirty years ago. The square parachutes look like an airplane's wing, and they create lift in much the same way. Until recently, however, square parachutes weren't improved upon much. Perhaps their superiority over round parachutes left everyone satiated. That lack of progress was unfortunate; if recent improvements -- like many-celled parachutes and automatic activation devices -- had been pursued many years ago, skydiving would be even safer today.

The moral from this is to question satisfaction, and that will be our mantra for this column. In particular, I want you to question the performance gains you've seen by moving to native PowerPC code. In this column we'll look at improved tools for examining PowerPC code performance, and you'll see how such questioning can really enlighten you.

ILLUSIONS

The PowerPC processors can issue multiple instructions at once. You therefore may think they'll tear through your code, executing many instructions per cycle. While this is sometimes true, a number of hurdles keep the PowerPC processors from completing even one instruction per cycle. These hurdles include instruction cache misses, data cache misses, and processor pipeline stalls.

What may surprise you is how often the processor sits idle because of these hurdles. I did some tests and found that while opening new windows in one popular application, a Power Macintosh 8500's processor completed an average of only one instruction for every two cycles. This is not very efficient, considering its PowerPC 604 processor can complete up to four instructions per cycle.

Much of that inefficiency is from instruction and data cache misses. As PowerPC processors reach faster clock rates, these cache misses will have an increasing impact. By minimizing cache misses we could realize a significant performance improvement.

Simply recompiling your 680x0 code to native PowerPC code doesn't typically generate efficient code. Many designs and data structures for the 680x0 architecture work very poorly when ported to PowerPC code. When you port native, you should carefully examine your code. Tuning for a cached RISC architecture is very different than for the 680x0 family. Here are some important things to consider:

  • Redesign your data structures. Use long word-sized elements. Keep commonly used elements together, and keep everything aligned on double long word boundaries.

  • Keep results in local variables, instead of recomputing or calling subroutines to retrieve global variables.

BETTER PROFILING

Until recently you couldn't measure cache misses unless you had a logic analyzer or other expensive hardware. The PowerPC 604 processor, however, includes an extremely useful performance measurement feature: two special registers (plus a register to control them) that can count most events that occur in the processor. Each of these registers can count about 20 events, and there are five basic events that both registers can count.

Here are just a few examples of what you can count with these registers: integer instructions that have completed; mispredicted branch instructions; data cache misses; and floating-point instructions that have been issued.

To use the performance profiling that the PowerPC 604 processor provides, you'll need to have one of the newer Macintosh models that include this processor, such as the Power Macintosh 9500 or 8500. This will cost less than a logic analyzer yet allow you to get detailed performance profiles.

Although these registers will show your software's performance only on a 604-based Power Macintosh, your software's cache usage and efficiency should be similar on other PowerPC processors. Use the 604's special abilities to profile your code and you'll benefit on all Power Macintosh models.

For more accurate performance measurements, you may want to use the DR Emulator control panel, which is provided on this issue's CD. With this control panel you can turn off the dynamic recompilation feature of the new emulator; this feaure, which is described in the Balance of Power column in Issue 23, can affect the performance of your tests over time.

    Also provided on the CD is the POWER Emulator control panel. This control panel lets you turn off the Mac OS support for RS/6000 POWER instructions and thus check for these instructions in your code (they'll cause a crash).*

THE 4PM PERFORMANCE TOOL

To use the new 604 performance registers, you don't need to program in PowerPC assembly language. On this issue's CD we've included a prototype application called 4PM. This tool, which was developed by engineer Tom Adams in Apple's Performance Evaluation Group, uses the PowerPC 604-specific registers to provide various types of performance data.

4PM is very simple to use. It presents three key menus: Control, Config, and Tests, as shown in Figure 1. You use these menus to select the type of performance measurement and an application you'd like to run the tests on. The application you're testing is launched by 4PM, and you can gather data either continuously or, using a "hot key," exactly when you want.


Figure 1. 4PM menus

Once a test completes, 4PM fills a window with the results -- a tabular summary with a different test run on each line. The Save command in the File menu will write the results to a file of type 'TEXT'.

The Control menu. Use the Launch command in this menu to select an application and run it, gathering the test data specified with the Config or Tests menu. The default configuration will measure cycles and instructions completed between when the application launches and when it quits. The Launch Again command simply relaunches the last application you tested.

Check Use Hotkey if you'd like to control exactly when data is gathered. With this option, you start and stop collecting data by holding down the Command key while pressing the Power key. (This key combination is the same way to force entry to MacsBug, which you'll be unable to do during the tests.)

The Repeats command is just a shortcut that's handy if you're repeating a test multiple times. If you specify a repeat value with this command, your test application will be relaunched that many times after you quit it.

The Intervals command allows you to collect data points at regular intervals; a dialog box offers the choices 10 milliseconds, 100 milliseconds, 1 second, or Other. Normally just a total is collected, but by specifying an interval time you'll instead receive a spreadsheet of timings. This will show what your code's performance was as the test progressed.

The Config menu. The commands in the Config menu allow you to tailor the test data by specifying exactly which events each register will count. The Count Select command lets you specify the machine states to collect data in; set this to "User Only" since you'll be tuning application code.

The Tests menu. The commands in the Tests menu are for generating typical reports. Use the calibrate command to count the five basic events that are common to both 604 performance registers, including cycles and instructions completed; with this test selected, the Launch command will run your application five times, successively counting each of these events. You can use one of the remaining tests to collect more specific measurements. The caches, load/store, execution units, and special instructions tests each generate a report for the corresponding aspect of 604 performance. The Describe command displays a window describing which events are counted in the selected test. Use the New command to create your own tests. These new tests are automatically saved; you can use the Delete command to remove any that you've added.

ASSEMBLY USAGE

If you want finer results, you should read and write to the 604 performance registers directly. This requires writing in PowerPC assembly language, but it allows you complete control over what data you'll collect for your time-critical code.

You'll be accessing three new special-purpose registers: MMCR0, PMC1, and PMC2. MMCR0 controls which events will be recorded and when exactly to record. The performance monitor counter registers, PMC1 and PMC2, are the registers in which you'll read the results. I'll give a brief summary of how to use these registers, but you'll need to read Chapter 9 of the PowerPC 604 RISC Microprocessor User's Manual for details.

MMCR0 is a 32-bit register that specifies all the options for performance measurement. Most of these options aren't important to your application profiling, and you should at first leave the high 19 bits of MMCR0 set to 0. The low 13 bits, however, specify which events you want counted in PMC1 and PMC2. Bits 19 through 25 select PMC1, and bits 26 through 31 select PMC2. See Chapter 9 of the 604 user's manual to learn which specific bits to set.

Here's an example of how to measure data cache misses per instruction:

.eq PMC1_InstructionsCompleted   2 << 6
.eq PMC2_DataCacheMisses         6
.eq MMCR0_StopAllRecording      $80000000

   li         r0, MMCR0_StopAllRecording
   mtspr      MMCR0, r0     ; stop all recording
   li         r0, 0
   mtspr      PMC1, r0      ; zero PMC1
   mtspr      PMC2, r0      ; zero PMC2
   li         r0, PMC1_InstructionsCompleted +
                PMC2_DataCacheMisses
   mtspr      MMCR0, r0     ; start recording
Notice that we load MMCR0 with only the most significant bit set to turn off all recording. This holds PMC1 and PMC2 at their current values and allows us to also zero PMC1 and PMC2 before we start recording. When you're done measuring, follow with this code:
   li         r0, MMCR0_StopAllRecording
   mtspr      MMCR0, r0    ; stop all recording
   mfspr      PMC1, r3     ; r3 is number of
                           ; instructions completed
   mfspr      PMC2, r4     ; r4 is data cache misses
Notice again that we turn off recording before reading the results. Otherwise the very act of reading the registers would affect the results; it will slow your code slightly, since the mtspr and mfspr instructions take multiple cycles to complete.

Don't record over very long periods of time, because the PMC1 and PMC2 registers can overflow. To measure over long periods, you should periodically read from the registers, add the result to a 64-bit number in memory, and clear the registers to prevent this overflow.

Don't ship any products that rely on these performance registers. They're supported only in the current 604 processor, and they're not part of the PowerPC architecture specification.

COMPLACENCY

The moral is the same as for my tale of the square parachutes: question satisfaction. Don't become complacent about the performance of your new native PowerPC applications. The profiling tools described here should help you more accurately measure and identify bottlenecks in your PowerPC code. Use that information to tune -- especially paying attention to memory usage -- and you'll be surprised how much faster your product will run. Macintosh users consistently hunger for faster computers and more responsive software; spend some serious time tuning, and they'll thank you for it.

DAVE EVANS likes to go skydiving when he can get away from his job gluing together the Mac OS software at Apple. He has gone a few times now, but he'll always cherish the memory of his first jump. Friends on the ground that day claim to have clearly heard his scream, although he was nearly a mile above them when he left the plane. On his second leap, if he hadn't opened the chute while upside down and then watched it deploy through his legs, he might have noticed more of the surrounding countryside.

Thanks to Tom Adams, Geoff Chatterton, Mike Crawford, and Dave Lyons for reviewing this column.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Thunderbird 52.3.0 - Email client from M...
As of July 2012, Thunderbird has transitioned to a new governance model, with new features being developed by the broader free software and open source community, and security fixes and improvements... Read more
coconutBattery 3.6.3 - Displays info abo...
With coconutBattery you're always aware of your current battery health. It shows you live information about your battery such as how often it was charged and how is the current maximum capacity in... Read more
Little Snitch 4.0.2 - Alerts you about o...
Little Snitch gives you control over your private outgoing data. Track background activity As soon as your computer connects to the Internet, applications often have permission to send any... Read more
VueScan 9.5.82 - Scanner software with a...
VueScan is a scanning program that works with most high-quality flatbed and film scanners to produce scans that have excellent color fidelity and color balance. VueScan is easy to use, and has... Read more
Postbox 5.0.17 - Powerful and flexible e...
Postbox is a new email application that helps you organize your work life and get stuff done. It has all the elegance and simplicity of Apple Mail, but with more power and flexibility to manage even... Read more
CleanMyMac 3.8.6 - $39.95
CleanMyMac makes space for the things you love. Sporting a range of ingenious new features, CleanMyMac lets you safely and intelligently scan and clean your entire system, delete large, unused files... Read more
Default Folder X 5.1.6b3 - Enhances Open...
Default Folder X attaches a toolbar to the right side of the Open and Save dialogs in any OS X-native application. The toolbar gives you fast access to various folders and commands. You just click on... Read more
Amazon Chime 4.6.5852 - Amazon-based com...
Amazon Chime is a communications service that transforms online meetings with a secure, easy-to-use application that you can trust. Amazon Chime works seamlessly across your devices so that you can... Read more
VOX 2.8.30 - Music player that supports...
VOX just sounds better! The beauty is in its simplicity, yet behind the minimal exterior lies a powerful music player with a ton of features and support for all audio formats you should ever need.... Read more
iFFmpeg 6.4.3 - Convert multimedia files...
iFFmpeg is a comprehensive media tool to convert movie, audio and media files between formats. The FFmpeg command line instructions can be very hard to master/understand, so iFFmpeg does all the hard... Read more

The best games we played this week - Aug...
Another busy week has come to a close. We played a lot of excellent games this week and now it's time to look back and reflect on some our favorites. Here are our picks for the week of August 18. [Read more] | Read more »
War Wings beginner's guide - how to...
War Wings is the newest project from well-established game maker Miniclip. It's a World War II aerial dogfighting game with loads of different airplane models to unlock and battle. The game offers plenty of single player and multiplayer action. We... | Read more »
How to win every 2v2 battle in Clash Roy...
2v2 is coming back to Clash Royale in a big way. Although it's only been available for temporary periods of time, 2v2 has seen a hugely positive fan response, with players clamoring for more team-based gameplay. Soon we'll get yet another taste of... | Read more »
Roll to Win with Game of Dice’s new upda...
Joycity’s hit Game of Dice gets a big new update this week, introducing new maps, mechanics, and even costumes. The update sets players loose on an exciting new map, The Cursed Tower, that allows folks to use special Runes mid-match. If you feel... | Read more »
Bottom of the 9th (Games)
Bottom of the 9th 1.0.1 Device: iOS iPhone Category: Games Price: $4.99, Version: 1.0.1 (iTunes) Description: Play the most exciting moment of baseball in this fast-paced dice and card game! | Read more »
The best apps for viewing the solar ecli...
If you somehow missed the news, many parts of the United States will be witness to a total solar eclipse on August 21 for the first time in over 90 years. It'll be possible to see the eclipse in at least some capacity throughout the continental U... | Read more »
The 5 best mobile survival games
Games like ARK: Survival Evolved and Conan Exiles have taken the world of gaming by storm. The market is now flooded with hardcore survival games that send players off into the game's world with nothing but maybe the clothes on their back. Never... | Read more »
Portal Walk (Games)
Portal Walk 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Portal Walk is adventure and relaxing platform game about Eugene. Eugene stuck between worlds and trying to find way back home.... | Read more »
Technobabylon (Games)
Technobabylon 1.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: City of Newton, 2087. Genetic engineering is the norm, the addictive Trance has replaced almost any need for human interaction,... | Read more »
5 reasons why 2v2 is the best mode in Cl...
Supercell has been teasing fans with 2v2 windows that allow players to team up for limited periods of time. The Summer of 2v2 was just this past July, but players are already clamoring for more of that sweet, sweet team-based action. The fans have... | Read more »

Price Scanner via MacPrices.net

Back To School With The Edge Desk All-in-one...
Back to school is just around the corner, and the ergonomically correct Edge Desk all-in-one portable kneeling desk is ideal for students living in dorms and small apartments, Edge Desk features:... Read more
Norton Core Secure Wi-Fi Router Now Available...
First introduced at the 2017 Consumer Electronics Show (CES), Norton Core, a secure, high-performance Wi-Fi router, fundamentally changed the concept of Wi-Fi routers by making security the primary... Read more
ViewSonic Adds New 27-inch 4K UHD Monitor to...
ViewSonic Corp. has introduced the VP2785-4K, a 27-inch 4K UHD (3840×2160) monitor that delivers precise and consistent color representation and performance to ensure incredible image quality. Built... Read more
Apple now offering Certified Refurbished 2017...
Apple is now offering Certified Refurbished 2017 27″ iMacs for up to $350 off original MSRP. Apple’s one-year warranty is standard, and shipping is free. The following models are available: – 27″ 3.... Read more
13-inch 2.3GHz MacBook Pros on sale for $100...
Amazon has the new 2017 13″ 2.3GHz MacBook Pros on sale today for $100 off MSRP, each including free shipping: – 13″ 2.3GHz/128GB Space Gray MacBook Pro (MPXQ2LL/A): $1199.99 $100 off MSRP – 13″ 2.... Read more
Clearance 2016 13-inch MacBook Airs available...
B&H Photo has clearance 2016 13″ MacBook Airs available for up to $200 off original MSRP. Shipping is free, and B&H charges NY & NJ sales tax only: – 13″ 1.6GHz/128GB MacBook Air (MMGF2LL... Read more
Clearance 21-inch and 27-inch iMacs available...
B&H Photo has clearance 21″ and 27″ Apple iMacs available for up to $500 off original MSRP, each including free shipping plus NY & NJ sales tax only: – 27″ 3.3GHz iMac 5K: $1799 $500 off... Read more
New iOS 11 Productivity Features Welcome But...
The iOS community is in late summer holding mode awaiting the September arrival of the iPhone 8 and iOS 11. iOS 11 public betas have been available for months — number six was released this week —... Read more
Samsung Electronics Launches New Portable SSD...
Samsung Electronics America, Inc. has announced the launch of Samsung Portable SSD T5 – its newest portable solid state drive (PSSD) that raises the bar for the performance of external memory... Read more
TrendForce Reports YoY Gain of 3.6% for 2Q17...
Market research firm TrendForce reports that the global notebook shipments for this second quarter registered a sequential quarterly increase of 5.7% and a year-on-year increase of 3.6%, totaling 39.... Read more

Jobs Board

Development Operations and Site Reliability E...
Development Operations and Site Reliability Engineer, Apple Payment Gateway Job Number: 57572631 Santa Clara Valley, California, United States Posted: Jul. 27, 2017 Read more
Frameworks Engineering Manager, *Apple* Wat...
Frameworks Engineering Manager, Apple Watch Job Number: 41632321 Santa Clara Valley, California, United States Posted: Jun. 15, 2017 Weekly Hours: 40.00 Job Summary Read more
Business Development Manager - *Apple* Medi...
Job Summary Apple Music is a single, intuitive app that...- all in one place. You can stream any Apple Music song, playlist or album, and download it Read more
Development Operations and Site Reliability E...
Development Operations and Site Reliability Engineer, Apple Payment Gateway Job Number: 57572631 Santa Clara Valley, California, United States Posted: Jul. 27, 2017 Read more
Frameworks Engineering Manager, *Apple* Wat...
Frameworks Engineering Manager, Apple Watch Job Number: 41632321 Santa Clara Valley, California, United States Posted: Jun. 15, 2017 Weekly Hours: 40.00 Job Summary Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.