
- Home
- Magazine
- Conference & Seminars
- News
- Archives
- Forums
- Store
- Directory
- Editorial
- Advertising
- User/Login
- Contact



Volume Number: 15 (1999)
Issue Number: 4
Column Tag: CodeWarrior Workshop
by Andrew Downs
Code optimization involves the application of rules and algorithms to program code with the goal of making it faster, smaller, more efficient, and so on. Often these types of optimizations conflict with each other: for instance, faster code usually ends up larger, not smaller. Optimizations can be performed at several levels (e.g. source code, intermediate representations), and by various parties, such as the developer or the compiler/optimizer.
This article discusses and demonstrates the code optimization options available in CodeWarrior Pro. In spite of its ever-growing sophistication, CodeWarrior still makes it easy to select the right level of optimization for your program. Understanding how CodeWarrior applies optimizations will help when it is time to debug your code, especially when using a low-level debugger or when the source code is not available.
In order to understand the process, you need to speak the language. Here are definitions for the terms discussed in this article, plus a few extras. Keep in mind that the words replace, substitute, etc. do not imply physical changes to your original source code, but rather are reflected in the generated machine code, after your source code has been parsed and optimized.
Each term includes a brief description, including why it is important. Code examples illustrating some of these terms will be presented later in the article.
Figure 1 shows the dialog box pane used to specify instruction scheduling and peephole optimization in CodeWarrior.

Figure 1. PPC processor pane.
Global optimizations are specified in the optimization pane, shown in Figure 2. Global optimization can be performed with the intent of making the resulting code faster or smaller, but not both.

Figure 2. Global optimization pane.
The code examples presented in this article use either peephole optimization or the faster execution speed settings. The smaller code size setting is briefly discussed later in the article. Profiled results for smaller code size are provided in the article archive, but are not illustrated in code examples.
The Disassemble menu item under the Project menu in CodeWarrior allows you to view the compiler's output resulting from application of any optimization choices. This is extremely useful in that it gives you a chance to save an assembly listing of your code for use during debugging sessions. That menu item was used to produce the listings presented in this article.
Listed below are the PowerPC instructions used in this article. Following each is the expanded mnemonic, and a short description of what the instruction does (using the specified operands). The instructions are presented in mnemonic alphabetical order, not the order in which they are encountered in the code. For a thorough treatment of Power PC assembly language, refer to the references listed at the end of this article.
add r30,r30,r0addi r30,r30,1b *+12bdnz *-80bl.whileTest__Fvblt *-12bne *-28cmplw r31,r29cmplwir31,$0000li r30,0lwzr3,gLimit(RTOC)mr r31,r29mtctr r0mullw r0,r29,r28stmw r26,-24(SP)stw r31,-4(SP)subir31,r31,1subic.r31,r31,1The archive for this article includes the .c and .dump files. Listing 1 contains the source code used to test the optimization options. For illustration purposes, some of the functions contain redundant or unused variables, unoptimized loops, and so on.
The profiler calls and header include statement are commented out. The sample project file in the archive includes the profiler library, so if you would like to use the profiler, simply uncomment the four lines in the source code, compile, and run.
test.c
Sample code with loops and variable usage.
#include <stdio.h>
#include <stdlib.h>
//#include <profiler.h>
unsigned long cseTest( void );
unsigned long doWhileTest( void );
unsigned long whileTest( void );
unsigned long forTest( void );
unsigned long gLimit = 1000L;
int main( void ) {
//ProfilerInit( collectSummary, bestTimeBase, 10, 3 );
//ProfilerSetStatus( 1 );
unsigned long theResult;
unsigned long j = 0L;
j++;
theResult = whileTest();
theResult = doWhileTest();
theResult = cseTest();
theResult = forTest();
j-;
//ProfilerDump( "\ptest.c.prof");
//ProfilerTerm();
return 0;
}
unsigned long whileTest( void ) {
unsigned long theResult = 0L;
unsigned long i, limit = gLimit;
i = 0;
while ( i < limit )
i++;
return 0;
}
unsigned long doWhileTest( void ) {
unsigned long theResult = 0L;
unsigned long i, limit = 1000L;
unsigned long dest[ 1000 ], index = 0;
i = limit;
do {
dest[ index++ ] = i;
i-;
} while ( i > 0 );
return 0;
}
unsigned long cseTest( void ) {
unsigned long theResult = 0L;
unsigned long i, limit = 1000L;
unsigned long product = 0L;
unsigned long x = 4;
unsigned long y = 3;
theResult = limit;
i = limit;
do {
product += x * y;
} while ( i- > 0 );
return 0;
}
unsigned long forTest( void ) {
unsigned long i, limit = 1000, sum = 0;
for ( i= 0; i < limit; i++ )
sum += i;
return sum;
}
The peephole optimizer applies a set of pattern-matching rules to the code, with the goal of replacing inefficient sequences of instructions with more efficient sequences. A second goal is to perform some register usage optimizations based on variable lifetimes. Both benefits are shown in the following examples.
Note that function prolog and epilog code is only occasionally presented in the listings in this article. The full listings are included in the article archive.
In the first example, the non-optimized version of main() contains the following assembly code, where register r30 is used to hold the value of j:
int main( void ) {
//...
// Initialize j.
// unsigned long j = 0L;
li r30,0
// Increment j.
// j++;
addi r30,r30,1
// theResult = whileTest();
bl .whileTest__Fv
// ...
Enabling the peephole optimizer results in no action being taken on j, since its value is never used:
int main( void ) {
//...
// Initialize j.
// unsigned long j = 0L;
// Increment j.
// j++;
// No instructions generated!
// theResult = whileTest();
bl .whileTest__Fv
// ...
The second example uses the function doWhileTest(). The variable limit is accessed twice, but its value never changes. In addition, there is a subtract/compare instruction sequence. First, the unoptimized version:
unsigned long doWhileTest( void ) {
// ...
// unsigned long i, limit = 1000L;
// Initialize limit in r29.
li r29,1000
// i = limit;
// Copy limit into r31.
mr r31,r29
// i-;
// Subtract 1 from i (in r31).
subi r31,r31,1
// } while ( i > 0 );
// Compare to 0 and branch if not equal.
cmplwi r31,$0000
bne *-28
The peephole optimizer detects the redundant register usage for the variable limit, and the fact that the value of limit never changes. The peephole optimizer also finds the inefficient subtract/compare sequence. It transforms the function into:
unsigned long doWhileTest( void ) {
// ...
// Initialize limit in r31.
// This is the same register used for i, so no copy/move operation is required.
li r31,1000
// i-;
// Subtract with carry and set condition flags.
subic. r31,r31,1
// Branch based on result of subtraction operation.
// Comparison operation is implicit.
bne *-20
The CodeWarrior profiler, which can be used to measure (among other things) the amount of time spent in various functions, showed the following execution times for doWhileTest():
| Optimization Level | Function | Time (ms) | % of Total Time |
| None | doWhileTest() | 0.131 | 39.3 |
| Peephole | doWhileTest() | 0.105 | 34.4 |
These results were obtained using the bestTimeBase setting. At runtime, this translated into the PowerPC timebase setting, which uses the Real-Time Clock (RTC) register on PPC 601-based machines, and the Time Base (TB) register on later implementations, for accurate, low-overhead timing.
The fourth column is the percentage of total program running time spent in this function. This is very useful in determining where to begin optimizing. To use it properly, you should compare the results for all of the functions at a given optimization level. (The profiler summary for all functions and all optimization levels is included in the article archive.) You need to be careful, because the time and % of total time columns do not necessarily mirror each other. You will see in one of the examples (Global Optimization - Level 4) that sometimes an inverse relationship occurs: less time spent in a function may be accompanied by a greater percentage of total time!
In summary, even though its scope is relatively local, the peephole optimizer may be able to provide some performance benefits, depending on patterns it detects in the code. In this case, it decreased execution time slightly (by approximately 20%).
This level of optimization provides global register allocation and dead code elimination.
Here is the non-optimized version of whileTest():
unsigned long whileTest( void ) {
stw r31,-4(SP)
stw r30,-8(SP)
// unsigned long theResult = 0L;
li r0,0
stw r0,-16(SP)
// unsigned long i, limit = gLimit;
lwz r3,gLimit(RTOC)
lwz r30,0(r3)
// i = 0;
li r31,0
// while ( i < limit )
b *+8
i++;
addi r31,r31,1
cmplw r31,r30
blt *-8
The level 1 optimized version eliminates the non-volatile register saves, and uses volatile registers for local variables. Also, the assignment of theResult is eliminated, since its value is never used.
unsigned long whileTest( void ) {
// unsigned long theResult = 0L;
// unsigned long i, limit = gLimit;
lwz r3,gLimit(RTOC)
lwz r0,0(r3)
// i = 0;
li r3,0
// while ( i < limit )
b *+8
// i++;
addi r3,r3,1
cmplw r3,r0
blt *-8
| Optimization Level | Function | Time (ms) | % of Total Time |
| None | whileTest() | 0.025 | 7.5 |
| Global - Faster 1 | whileTest() | 0.025 | 7.5 |
In this case, the result is not faster, but it is more efficient in terms of number of instructions used (code size).
This level of optimization adds common subexpression elimination and copy propagation to those provided by level 1.
Here is the non-optimized version of cseTest():
unsigned long cseTest( void ) {
// Preserve the non-volatile registers (r26-r31).
stmw r26,-24(SP)
// unsigned long i, limit = 1000L;
li r27,1000
// unsigned long product = 0L;
li r30,0
// unsigned long x = 4;
li r29,4
// unsigned long y = 3;
li r28,3
// theResult = limit;
// i = limit;
mr r31,r27
// do {
// product += x * y;
// A multiply is an expensive (slow) operation.
mullw r0,r29,r28
add r30,r30,r0
// } while ( i- > 0 );
cmplwi r31,$0000
subi r31,r31,1
bne *-16
Enabling level 2 optimization changes the multiply instruction to an add (using a constant value of 12). As with level 1, the use of volatile registers (r0, r3-r12) replaces usage of non-volatile registers (r13-r31). As a result, the store multiple instruction goes away.
unsigned long cseTest( void ) {
// No store multiple register instruction.
// unsigned long theResult = 0L;
// ...
// theResult = limit;
li r4,0
// i = limit;
// do {
li r3,1000
// product += x * y;
// Add the constant value 12 to product.
addi r4,r4,12
// } while ( i- > 0 );
cmplwi r3,$0000
subi r3,r3,1
bne *-12
| Optimization Level | Function | Time (ms) | % of Total Time |
| None | cseTest() | 0.131 | 39.3 |
| Global - Faster 2 | cseTest() | 0.047 | 18.7 |
The optimized version is approximately 64% faster than the original!
This level of optimization adds loop transformations, strength reduction, and loop-invariant code motion to those provided by level 2.
Here is the non-optimized version of forTest():
unsigned long forTest( void ) {
stw r31,-4(SP)
stw r30,-8(SP)
stw r29,-12(SP)
// unsigned long i, limit = 1000, sum = 0;
li r29,1000
li r30,0
// for ( i= 0; i < limit; i++ )
li r31,0
b *+12
// sum += i;
add r30,r30,r31
addi r31,r31,1
cmplw r31,r29
blt *-12
Applying the optimizations results in an unrolling of the loop. This code now contains the equivalent of ten loop iterations in the body of the loop. Notice that the iterator (limit) has been reduced to the value of 100 to compensate.
unsigned long forTest( void ) {
// unsigned long i, limit = 1000, sum = 0;
li r3,0
li r4,0
b *+4
// for ( i= 0; i < limit; i++ )
// Iterate over the loop 1000 times (total).
// Use the count register to hold the adjusted value of limit (100).
li r0,100
mtctr r0
b *+4
// sum += i;
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
bdnz *-80
| Optimization Level | Function | Time (ms) | % of Total Time |
| None | forTest() | 0.046 | 13.9 |
| Global - Faster 3 | forTest() | 0.031 | 14.9 |
The optimized version is approximately 33% faster. Notice the increased percentage of total time: the time savings obtained in this loop must have been outweighed by more significant savings elsewhere in the optimized program.
This level of optimization adds repeated loop-invariant code motion and common subexpression elimination. The optimized code becomes a candidate for further optimizations, until no more opportunities are found.
For this example, we will compare the level 3 and 4 versions of the same function. Here is the level 3 optimized version of cseTest():
unsigned long cseTest( void ) {
// unsigned long theResult = 0L;
// unsigned long i, limit = 1000L;
// unsigned long product = 0L;
// unsigned long x = 4;
// unsigned long y = 3;
// theResult = limit;
// i = limit;
// do {
li r4,0
li r3,1000
b *+4
// product += x * y;
// } while ( i- > 0 );
addi r4,r4,12
cmplwi r3,$0000
subi r3,r3,1
bne *-12
Here is the level 4 version. Notice that the value of product is never calculated, probably because it is never used.
unsigned long cseTest( void ) {
// unsigned long theResult = 0L;
// ...
// do {
li r3,1000
b *+4
b *+4
// product += x * y;
// } while ( i- > 0 );
cmplwi r3,$0000
subi r3,r3,1
bne *-8
| Optimization Level | Function | Time (ms) | % of Total Time |
| Global - Faster 3 | cseTest() | 0.046 | 22.4 |
| Global - Faster 4 | cseTest() | 0.028 | 14.8 |
Again, much (39%) faster the second time around!
Although I am not presenting examples using global optimizations for smaller code, the profiler results for those settings are included in the archive. For reference, here are the generated code sizes for faster vs. smaller, for each optimization level:
Faster 0 360 48 (constant) Faster 1 260 Faster 2 220 Faster 3 308 Faster 4 316 Smaller 0 312 Smaller 1 260 Smaller 2 220 Smaller 3 236 Smaller 4 244| Faster/Smaller | Level | Code Size (bytes) | Data Size (bytes) |
| Faster | 0 | 360 | 48 (constant) |
| Faster | 1 | 260 | |
| Faster | 2 | 220 | |
| Faster | 3 | 308 | |
| Faster | 4 | 316 | |
| Smaller | 0 | 312 | |
| Smaller | 1 | 260 | |
| Smaller | 2 | 220 | |
| Smaller | 3 | 236 | |
| Smaller | 4 | 244 |
The smaller option always generated code that was no larger, and usually smaller, than its faster counterpart at each optimization level. Applying levels 3 and 4 always resulted in increased code size (compared to level 2). However, the percentage increase from level 2 to level 4 for the smaller setting (10.9%) is much lower than for the faster setting (43.6%).
Be careful in trying to apply these results to another program. Each program has its own unique characteristics, which may respond favorably to some types of optimization but not to others. Plus, your needs may dictate which type of optimization to perform. For example, code destined for embedded systems may require the smaller option.
Hand-optimized programs may still present the best performance. For example, you can optimize loops yourself, or include assembly functions mixed with high-level code. On the other hand, that approach is time-consuming, costly, and possibly less portable. Plus, not all developers have the necessary skills to hand-optimize. Relying on the development environment to perform optimizations is a good thing.
Writing a straightforward, correct program is key to generating an optimized program. That version becomes the starting point in the optimization process. Every subsequent adjustment or optimization, whether applied by the developer or the development environment, should only make the code faster or tighter.
Debugging efforts should focus on the unoptimized version, since the generated machine code most closely mirrors the original source. Once the program is bug-free (!), the desired type and level of optimization should be applied.
Optimization techniques can improve code size or speed. CodeWarrior provides several categories of optimization, including peephole, global, and instruction scheduling. You can apply any of these in combination via the project preferences dialog. It is worth reviewing your disassembled code to assess the impact the optimizations will have on subsequent debugging sessions. Debugging will be easier if performed on the unoptimized version of your code, since the optimized version of the code may not closely mirror the original source. Once the bugs are out, add the appropriate type of optimization.
Thanks to Tom Thompson of Metrowerks for reviewing this article and providing insight into CodeWarrior's optimization process.
The article archive, containing a CodeWarrior project file, source code, and disassembler and profiler dump files, is available at <www.mactech.com/>.
Andrew Downs is a Technical Lead for Template Software in New Orleans, LA, designing and building enterprise apps. He also teaches Java programming at Tulane University College. Andrew wrote the Macintosh freeware program Recent Additions, and the Java application UDPing. You can reach him at andrew@downs.net.




