TweetFollow Us on Twitter

Taking Advantage of The Intel Core Duo Processor-Based iMac

Volume Number: 22 (2006)
Issue Number: 7
Column Tag: Performance Optimization

Taking Advantage of The Intel Core Duo Processor-Based iMac

How to make your applications run faster

by Ganesh Rao and Ron Wayne Green

Introduction

This is the first of a three part series that will address the most effective techniques to optimize applications for the Intel(R) Core(TM) Duo processor-based Macs. Part one introduces the key aspects of the Core Duo processor, and exposes the architectural features for which tuning is most important. A data-driven performance methodology using the software development tools available on a Mac to highlight tuning and optimization opportunities for a variety of applications is then described at length. Intel Core Duo processors feature two execution cores and each of the cores is capable of vector processing of data, referred to as the Intel(R) Digital Media Boost, which extends the Single Instruction Multiple Data (SIMD) technology. The second part of this series outlines how to take advantage of SIMD by enabling vectorization in the Intel Compiler. The final part of this 3-part series provides readers with the next level of optimization by taking advantage of both execution cores in addition to SIMD. We will cover auto-parallelization, where simple loops can be rendered parallel. And finally we will cover OpenMP, which are powerful user-specified directives embedded in source code to auto-magically tell the compiler to thread the application. You will love how easily you can thread applications while at the same time maintaining fine grain control of threads.

In this article, advanced and innovative software optimizations techniques supported by industry-leading compilers are addressed. These optimization techniques are used in the field every day to get better performance. Key topics will be illustrated with C++ and Fortran code snippets.

Intel Core duo processor

There is a rumor going around that Apple Macs now use an Intel processor, and a very happy Intel processor at that! All humor aside, we know that the MacTech community is gaining a very sophisticated understanding of the details of the Intel Core Duo processor. We want to call out features in the processor that, based on our experience, are most likely to increase the performance of your application. Stated differently, in this section we call out processor features that can be leveraged to extract better application performance. The Intel Core Duo processor includes two execution cores in a single processor. Please see Figure 1. Each of the execution cores supports Single instruction Multiple Data (SIMD), which involves performing multiple computations with a single instruction in parallel. Please see Illustration 2 for a diagrammatic representation of SIMD.



Figure 1: Intel(R) Core(R) Duo processor architectue



Figure 2: SIMD performs the same operation on multiple data

Applications that are most likely to benefit from SIMD are those that can be characterized as 'loopy'. SIMD is quite commonly seen in programs that spend a significant amount of time processing integers and/or floating point numbers in a loop. An example of this is a matrix-multiply operation. Intel Streaming SIMD Extensions (SSE), and the AIM Alliance AltiVec* instructions are example implementations of SIMD. In a subsequent article, part 2 of this 3-part series, we will get an opportunity to share our best practices to taking advantage of the SIMD processing capability in your processor.

SIMD extracts the best performance of a single core. Taking this to the next level, it is obvious that one needs to keep both cores busy to get maximal performance from an application. The most optimal way of taking advantage of both execution cores is to thread your application. We will share some of our best known methods to thread applications in the third part of the series. We will wrap up our three part discussion by highlighting innovative compiler technologies.

Drawing the baseline

The start of any performance optimization activity should be the clear definition of the performance baseline. The unit of the baseline could be either transactions per second, or more simply, the run-time of the application. Our experience is that we are setting ourselves up for failure if we do not have a clear, reproducible understanding of the baseline. Having a reproducible baseline also means clearly defining your benchmark application with the correct workload that is representative of anticipated usage. It may be worthwhile at this stage to consider if you can peel out a part of the application you wish to examine and wrap a main() function around it. This technique allows you to observe the behavior of the section of the application of most interest. You can then use the 'time' utility to measure the time spent by the program. In most production applications, it is difficult to completely separate the kernel that we wish to observe and improve performance. In these cases, it may be easier to insert timers in your code as shown below:

Example:

/* Sample Timing */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void)
{
   clock_t start, finish;
   long loop;
   double  duration, loop_calc;
   start = clock();
   // CODE TO BE MEASURED HERE
   //
   finish = clock();
   duration = (double)(finish - start)/CLOCKS_PER_SEC;
   printf("\n%2.3f seconds\n", duration);
}

While it is perfectly fine to use this 'time' API for applications and sections of code that run for a sufficient duration, the resolution of the clock is not fine enough for measuring a small, fast-running section of code.

An alternative is to use the rdtsc instruction (Read Time Stamp Counter). The rdtsc instruction returns the elapsed CPU clocks since the last reboot. This allows significantly higher resolution than using the 'time' API. Intel compilers implement a convenient intrinsic1 that makes it easy to measure rdtsc.

#include <stdio.h>
int main(void)
{
uint64_t start;
uint64_t stop;
uint64_t elapsed;
  
  #if __INTEL_COMPILER  
  // Start the counter
start=_rdtsc();  
#else   
  
  //Code to be measured here
  
  ...
  
//
#if __INTEL_COMPILER  
//Stop the counter
stop=_rdtsc();
elapsed = stop - start;
#else
//Calculate the runtime
elapsed = stop - start;
  printf("Processor cycles = %i64\n", elapsed); 
}

As of this writing, in some cases, rdtsc may report a wrong Time-Stamp counter value2. Using the technique described above with rdtsc does not work well if your thread switches context between the two cores, since the timer is separate on each core.

The other preferred alternative is to use the OS supported mach_absolute_time API abstraction.

#include <CoreServices/CoreServices.h>
#include <mach/mach.h>
#include <mach/mach_time.h>
int main(void)
{
    uint64_t        start;
    uint64_t        stop;
    uint64_t        elapsed;
    // Start the clock.
    start = mach_absolute_time();
    //Code to be measured here
  
    ...
  
    //
    // Stop the clock.
    stop = mach_absolute_time();
    // Calculate the run time
    elapsed = stop - start;
    printf("Processor cycles = %i64\n", elapsed); 
}

In the measurements we did, while mach_absolute_time and rdtsc seemed to provide answers that were close, there were small deviations. We need to clarify that while it may be comforting to think that we are measuring at the accuracy of clock-ticks, the measurements come bundled with a lot of variances. Specifically, you cannot measure the latency of a single instruction or even a bundle of instructions using either rdtsc or mach_absolute_time. In many cases, it is to the benefit of the programmer to set up benchmarks that have a sufficient runtime between start and stop timer. A sufficient runtime may be at a minimum on the order of tens or hundreds of seconds.

Hotspots in the code

Once we have a baseline, a powerful alternative to hand peeling code and inserting timers is to run a profiler to identify the hotspots in your code. Shark3 is a powerful tool to help you achieve this. We are not going to go into too much detail about using Shark in this article, since it is covered extensively elsewhere. Additionally, Shark can do much more than what we are calling out here. At a high level, Shark allows you to get a time profile which is based on sampling your code at fixed time intervals. Depending on your application, you may see profiles that are relatively flat, meaning there are no particular areas in your code that are exercised more than others. Or you could see clear peaks, which would mean that your program exercises a smaller portion of your code more extensively. Shark can clump the time profile by threads allowing you to see the profile of your code for each of the individual threads.

As a quick guide, start Shark from the hard disk at "/Developer/Applications/Performance Tools/CHUD"4. Figure 3 shows the start of a Shark session.



Figure 3: Shark Info window

Don't hit the Shark "start" button yet. First, start the application you need to profile. Hit the "start" button in Shark. Once started, Shark will automatically stop after 30 seconds or you can choose to hit "stop". Note that it is a good idea to take Shark snapshots over slightly extended periods to get repeatable results. Also, make sure that you have stopped running other applications so as to not pollute the profile gathered. Depending on your application, you may choose to start after your application has "warmed up" or progressed beyond startup initializations and initial file IO. If you are experienced with your application and its runtime behavior, it is relatively easy to know the hotspots in your code, and where they occur during a typical run. Thus, a correct technique is to monitor your application's log output, determine when the hotspot is started, start Shark, and gather a profile over a sufficient length of time.



Figure 4: Shark Time Profile

Note that at this stage it may still be to your advantage to insert timers in your code with print-statements as we saw in the previous section around the areas of code that are of interest to you.

Using the techniques highlighted above, we can gain insight into the operating characteristics of programs, and understand where we can make a difference. We can generally think of performance improvement for the serial portion of the code, but also consider threading the code and consider performance improvements due to threading. We can do a back-of-the-envelope estimate of the potential degree to which the performance of the overall application can be optimized due to serial improvements in the code, using Amdahl's law, as illustrated below.

Let us say that the hotspot or the section of the serial code we are optimizing is taking up fraction x of the total program run time. Then a speedup of fraction y on this section of the code should theoretically improve overall performance by 1/ ((1-x) + x/y). As a limiting condition, the theoretical maximum speedup possible is 1/(1-x). The limiting maximum speed up would occur if the section of the code we are considering takes zero time to run. As an example, if a section we are focused on is taking 50% of the total run time (x = .5), and we provide a doubling of speed (y = 2) in this section, we can expect an overall speedup of 1/(.5+(.5/2)) = 1/.75 = 1.33 or 33% speedup of the overall performance. As a theoretical maximum, we can get a 2x performance gain for the whole application where fraction x = .5, when speedup y tends to infinity.

Once we determine where we can make a difference, and how much of a difference we can make, we can then look at ways and means in which to make improvements. Please note that while in this article we are looking at serial improvements, in a future article we will look at estimating and planning for parallel improvements in detail.

One other related note before we end this section. Note that compilers as part of optimization can completely eliminate chunks of code it determines will not effect the outcome of the final program, also referred to as dead code elimination. While this is a very good thing for real applications, you need to be careful to ensure that the compilers do not throw away the performance kernel you have extracted in a snippet program in order to examine. Typically an output statement of the result will be all that is required to ensure that the Compiler does not eliminate the small section of code.

COMPILERS

This may sound like a cliche, but perhaps the first and the foremost tool at your disposal to make a performance difference should be your compiler. In addition to the GNU (gcc) Compiler, we will be discussing using the Intel(R) C++ compiler in the following sections. Both compilers integrate into Apple's Xcode Integrated Development Environment, and are binary and source compatible. Fortran developers can use the Intel(R) Fortran Compiler for Mac OS or several GNU options including g77, gfortran, or G95. While GNU is invoked with the 'gcc' command line, Intel Compilers are invoked with the 'icc' command line for C/C++ and the 'ifort' command for Fortran. While the examples that follow use the Intel C/C++ compiler, the same options apply to the Intel Fortran compiler (ifort).

Generally speaking, newer versions of the compiler optimize for systems running newer processors. You can verify the version of the compiler by using the -v flag.

$ icc -v
Version 9.1
$ gcc -v
Using built-in specs.
Target: i686-apple-darwin8
Configured with: /private/var/tmp/gcc/gcc-5250.obj~12/src/configure --disable-checking 
-enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ 
--program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 
--build=powerpc-apple-darwin8 --with-arch=pentium-m --with-tune=prescott --program-prefix= 
--host=i686-apple-darwin8 --target=i686-apple-darwin8
Thread model: posix
gcc version 4.0.1 (Apple Computer, Inc. build 5250)
  

Here is a very brief run down of the general optimization options available with the compilers. O0 (gcc -O0 or icc -O0) means no optimization is turned on. While it may be helpful to have O0 option to debug applications, your application will run at significant sub-optimal speed at this option level.

O1 and O2 are higher levels of optimization. O1 usually makes optimization tradeoffs that result in smaller compile time compared to O2.

O3 is the highest level of optimization and makes aggressive decisions on optimizations that require a judgment call between the size of the generated code, and the expected resulting speed of the application.

We should note here that despite throwing the best optimization options, compilers can still use your help. As an example, let us look at an often overlooked performance hit: denormals5, denormalized IEEE floating point representations in your code, can trigger exceptions that could result in severe runtime penalties. This is because denormals may require hardware and the OS to intervene in operations using denormal operands. When your application frequently uses very small numbers, you should consider taking advantage of the flush-to-zero (also referred to as FTZ for short) feature. The FTZ feature allows the CPU to take denormal values in registers within the CPU, and convert those values to zero, a valid IEEE representation. FTZ is default when using SIMD.

Consider the following example where denormals are deliberately triggered for illustration. Here, we look at the timing between gcc and icc for the following example:

#include <stdio.h>
main()
{
        long int i;
        double coefficient = .9;
        double data = 3e-308;
        for (i=0; i < 99999999; i++)
        {
                data *= coefficient;
        }
        printf("%f\t %x\n", data, *(unsigned long*)&data);
}
$ g++ -O3 denormal.cpp -o gden
$ time ./gden
0.000000         5
   real    0m13.462s
user    0m12.676s
sys     0m0.041s
$ icc denormal.cpp -o iden
denormal.cpp(8) : (col. 9) remark: LOOP WAS VECTORIZED.
$ time ./iden
0.000000         0
real    0m0.178s
user    0m0.138s
sys     0m0.006s

Notice that since the loop was fairly simple, the Intel compiler was able to vectorize the loop, and therefore use SIMD. Because Flush-To-Zero is the default when using SIMD registers, notice that the runtime improvement can be dramatic. We will dive into SIMD and auto-vectorization in more detail in the next installment of this series of articles.

Next installment

Now that we had a chance to go through the introductions, in the next installment, we will see how to pack a punch in your optimizations, without going through the tedious process of hand assembling instructions or even intrinsics. We will accomplish this by taking advantage of the Auto-vectorization feature. And yes, if you have Altivec code or SSE instructions that you are intending to migrate to take advantage of Auto-vectorization, then the next installment is a must read for you!

In the meantime, hopefully you will get the chance to visit with some members of the Intel Software Development Products team at WWDC.


Both authors are members of the Intel Compiler team. Ganesh Rao has been with Intel for over nine years and currently helps optimize applications to take advantage of the latest Intel processors using the Intel Compilers.

Ron Wayne Green has been involved in Fortran and high-performance computing applications development and support for over twenty years, and currently assists with Fortran and high-performance computing issues.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Adobe Premiere Pro CC 2015 9.0.1 - Digit...
Premiere Pro CC 2015 is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous Premiere Pro customer). Premiere Pro CS6 is still available for... Read more
Adobe After Effects CC 2015 13.5.1 - Cre...
After Effects CC 2015 is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous After Effects customer). After Effects CS6 is still available... Read more
Adobe Creative Cloud 2.2.0.129 - Access...
Adobe Creative Cloud costs $49.99/month (or less if you're a previous Creative Suite customer). Creative Suite 6 is still available for purchase (without a monthly plan) if you prefer. Introducing... Read more
Tower 2.2.3 - Version control with Git m...
Tower is a powerful Git client for OS X that makes using Git easy and more efficient. Users benefit from its elegant and comprehensive interface and a feature set that lets them enjoy the full power... Read more
Apple Java 2015-001 - For OS X 10.7, 10....
Apple Java for OS X 2015-001 installs the legacy Java 6 runtime for OS X 10.11 El Capitan, OS X 10.10 Yosemite, OS X 10.9 Mavericks, OS X 10.8 Mountain Lion, and OS X 10.7 Lion. This package is... Read more
Adobe Muse CC 2015 2015.0.1 - Design and...
Muse CC 2015 is available as part of Adobe Creative Cloud for as little as $14.99/month (or $9.99/month if you're a previous Muse customer). Muse CS6 is still available for purchase (without a... Read more
Adobe Illustrator CC 2015 19.1.0 - Profe...
Illustrator CC 2015 is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous Illustrator customer). Illustrator CS6 is still available for... Read more
Corel Painter 14.1.0.1105 - Digital art...
Corel Painter helps you create astonishing art in a variety of media. Paint with vivid oil paints, fluid water colors, and earthy charcoals. Corel Painter flawlessly recreates the tones and textures... Read more
Pacifist 3.5.4 - Install individual file...
Pacifist opens up .pkg installer packages, .dmg disk images, .zip, .tar. tar.gz, .tar.bz2, .pax, and .xar archives and more, and lets you extract or install individual files out of them. This is... Read more
Dropbox 3.8.4 - Cloud backup and synchro...
Dropbox is an application that creates a special Finder folder that automatically syncs online and between your computers. It allows you to both backup files and keep them up-to-date between systems... Read more

Mazes of Karradash (Games)
Mazes of Karradash 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: The city of Karradash is under attack: the monsters of the Shadow Realms are emerging from the depths.No adventurer is... | Read more »
Battle Golf is the Newest Game from the...
Wrassling was a pretty weird - and equally great - little wressling game. Now the developers, Folmer Kelly and Colin Lane, have turned their attention to a different sport: golfing. This is gonna be weird. [Read more] | Read more »
Qbert Rebooted has the App Store Going...
The weird little orange... whatever... is back, mostly thanks to that movie which shall remain nameless (you know the one). But anyway it's been "rebooted" and now you can play the fancy-looking Qbert Rebooted on iOS devices. [Read more] | Read more »
Giant Monsters Run Amok in The Sandbox...
So The Sandbox has just hit version number 1.99987 (seriously), and it's added a lot more stuff. Just like every other update, really. [Read more] | Read more »
Fish Pond Park (Games)
Fish Pond Park 1.0.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0.0 (iTunes) Description: Nurture an idyllic slice of tourist's heaven into the top nature spot of the nation, furnishing it with a variety of... | Read more »
Look after Baby Buddy on your Apple Watc...
Parigami Gold is the new premium version of the match three puzzler that includes Apple Watch support and all new content. You won't simply be sliding tiles around on your wrist, the Apple Watch companion app is an all new mini-game in itself. You'... | Read more »
Swallow all of your opponents as the big...
Eat all of the opposition and become the largest ball in Battle of Balls now available in the App Store and Google Play. Battle of Balls pits you against other opponents in real time and challenges you to eat more balls and grow larger than all of... | Read more »
PAC-MAN Championship Edition DX (Games)
PAC-MAN Championship Edition DX 1.0.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0.0 (iTunes) Description: It’s Your World. EAT IT! Get ready for more ghost chain gobbling and frantic action in PAC-MAN® CE-DX! The... | Read more »
incurve (Games)
incurve 1.0 Device: iOS Universal Category: Games Price: $.99, Version: 1.0 (iTunes) Description: Get ready for 2 different gravities Goal is to hit as many white dots on your way up.When you're touching the screen, the dots have a... | Read more »
Crossy Road has its Own Merch Store Now....
Do you like Crossy Road? I mean do you really like Crossy Road? Well then you're in luck! Hipster Whale has opened up a Crossy Road store, so you can show off your fandom via official T-shirts. [Read more] | Read more »

Price Scanner via MacPrices.net

Sale! 13″ 1.6GHz/256GB MacBook Air for $1099,...
B&H Photo has the 13″ 1.6GHz/256GB MacBook Air on sale for $1099 including free shipping plus NY tax only. Their price is $100 off MSRP, and it’s the lowest price available for this model. Read more
iPad mini 4 To Be Upgraded To iPad Air 2 Spec...
There’s a certain inevitability about making Apple product predictions this time of year. Come September, we can pretty reliably count on the release of refreshed iPhones, along with the iOS 9... Read more
Apple restocks refurbished Mac minis for up t...
The Apple Store has restocked Apple Certified Refurbished 2014 Mac minis, with models available starting at $419. Apple’s one-year warranty is included with each mini, and shipping is free: - 1.4GHz... Read more
13-inch 2.5GHz MacBook Pro on sale for $899,...
Best Buy has the 13″ 2.5GHz MacBook Pro available for $899.99 on their online store. Choose free shipping or free instant local store pickup (if available). Their price is $200 off MSRP. Price is... Read more
21-inch 2.9GHz iMac on sale for $1299, save $...
Best Buy has the 21″ 2.9GHz iMac on sale today for $1299.99 on their online store. Choose free shipping or free local store pickup (if available). Their price is $200 off MSRP, and it’s the lowest... Read more
Free Image Sizer 1.3 for iOS Offers Photo Edi...
Xi’An, China based G-Power has announced the release of Image Sizer 1.3 for the iPhone, iPad, and iPod touch, an important update to their free photo editing app. Image Sizer’s collection of easy to... Read more
Sale! 13″ 1.6GHz/128GB MacBook Air for $899,...
B&H Photo has the 13″ 1.6GHz/128GB MacBook Air on sale for $899 including free shipping plus NY tax only. Their price is $100 off MSRP, and it’s the lowest price available for this model. Read more
13-inch Retina MacBook Pros on sale for $100...
Best Buy has 13-inch Retina MacBook Pros on sale for $100 off MSRP on their online store. Choose free shipping or free local store pickup (if available). Prices are for online orders only, in-store... Read more
Will BMW’s i3 Electric Vehicle Be The Automo...
The German-language business journal Manager Magazin’s Michael Freitag reports that Apple and the German performance/luxury automaker Bayerishe Motoren Werke (BMW) are back at far-reaching... Read more
Sale! $250 off 15-inch Retina MacBook Pro, $2...
B&H Photo has lowered their price for the 15″ 2.2GHz Retina MacBook Pro to $1749, or $250 off MSRP. Shipping is free, and B&H charges NY sales tax only. They have the 27″ 3.3GHz 5K iMac on... Read more

Jobs Board

*Apple* Customer Experience (ACE) Leader - A...
…management to deliver on business objectives Training partner store staff on Apple products, services, and merchandising guidelines Coaching partner store staff on Read more
Project Manager - *Apple* Pay Security - Ap...
**Job Summary** The Apple Pay Security team is seeking a highly organized, results-driven Project Manager to drive the development of Apple Pay Security. If you are Read more
*Apple* TV Product Design Internship (Spring...
…the mechanical design effort associated with creating world-class products with the Apple TV PD Group. Responsibilities will include working closely with manufacturing, Read more
*Apple* Watch SW Application Project Manager...
**Job Summary** The Apple Watch software team is looking for an Application Engineering Project Manager to work on new projects for Apple . The successful candidate Read more
*Apple* Retail - Multiple Positions (US) - A...
Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.