TweetFollow Us on Twitter

Efficient 68040
Volume Number:9
Issue Number:2
Column Tag:Efficient coding

Related Info: Memory Manager

Efficient 68040 Programming

Optimizing your code to run faster on the 68040

By Mike Scanlin, MacTech Magazine Regular Contributing Author

The current trend towards more and more 68040s is clear to anyone who follows the Macintosh. Some sources say that most, if not all, of the Mac product line will be moved to the 68040 sometime in 1993. With QuickTime and Color QuickDraw already requiring at least a 68020, perhaps the day when system software and applications require a 68040 isn’t that far away. In preparation for that day, here some tips on how to write efficient code for the 68040.


One of the goals of the 040 designers was to increase the performance of the large installed base of 680x0 code that was already out there. They gathered 30MB of object code from several different platforms and profiled it to gather instruction frequency and other statistics. They used this information to influence the design of the cache structure and memory management system as well as which parts of the instruction set they would optimize.

From this trace data it was determined that most of the common instructions could execute in one clock cycle if the Integer Unit were pipelined and if the instructions weren’t larger than three words each. The resulting six stage pipeline optimizes several of the less-complicated addressing modes: Rn, (An), (An)+, -(An), (An, d16), $Address and #Data. These seven modes are called the optimized effective-address modes (OEA). When writing efficient 68040 code you should stick to these addressing modes and not use the others (i.e. don’t use instructions that are 4 words (8 bytes) or longer). Sequences of instructions comprised only of these addressing modes can be pipelined without stalls and will have a lower average instruction time than sequences of instructions containing 8 byte instructions every so often.

Figure 1 shows a comparison of cycle times between the 68020 and 68040 that illustrates some of the improvements made for the 68040 (RL stands for register list).


One thing to notice in the above table is that branches taken are now faster than branches not taken. This is different from all other non-68040 members of the 680x0 family. It’s somewhat annoying because it means that you can’t simultaneously optimize for both the 040 and the 030 (there are other cases of this, too, discussed a little further on). The reason Motorola did this is because their trace data of existing code showed that 75% of all branch instructions were taken.

In addition to switching which was the faster case, they also managed to speed up both cases by adding a dedicated branch adder that always calculates the destination address when it sees a branch instruction. If it turns out that the branch is not taken then the results of the branch adder are ignored.

25MHz cycles

Instruction Addr mode 68020 68040

Move Rn,Rn 2 1

Move <OEA>,Rn 6 1

Move Rn,<OEA> 6 1

Move <OEA>,<OEA> 8 2

Move (An,Rn,d8),Rn 10 3

Move Rn,(An,Rn,d8) 6 3

Move multiple RL,<OEA> 4+2n 2+n

Move multiple <OEA>,RL 8+4n 2+n

Simple arithmetic Rn,Rn 2 1

Simple arithmetic Rn,<OEA> 6 1

Simple arithmetic <OEA>,Rn 8 1

Sifts (1 to 31 bits) - 4 2

Branch taken - 6 2

Branch not taken - 4 3

Branch to subroutine - 6 2

Return from subroutine - 10 5

Figure 1

What all of this means in practical terms is that you should always write your code so that branches are taken rather than not taken. The most commonly executed thread should take all branches. For instance, this code:

 x = 1;
 if (likelyEvent)
 x = 2;

can be improved by switching the condition and forcing the branch after the Tst instruction to be taken (assuming likelyEvent is True more than half of the time):

 x = 2;
 if (!likelyEvent)
 x = 1;

Be careful when doing this, though, that the compiler doesn’t generate an extra instruction for the added “!”. If so, it’s not worth switching the condition. But in examples like the one given, the compiler can usually just change a Beq instruction to a Bne instruction and you’ll be better off.

While we’re on the subject of branches, here’s a trick you can use to do a fast unconditional branch on the 040 if you’re writing in assembly or using a clever C compiler (works on the 020 and 030, too, but takes longer on those): use the Trapn # (trap never immediate) instruction to unconditionally branch ahead by 2 or 4 bytes in 1 cycle. One example where this is useful is if you have a small clause (2 or 4 bytes) in an else statement.

First, define these two macros:

/* Trapn.W */
#define SKIP_TWO_BYTES  DC.W0x51FA
/* Trapn.L */

Now suppose you had this code:

 if (x) {
 y = 1;
 z = 2;
 q = 3;

The normal assembly generated might be:

 Tst    x
 Beq.S  @1
 Moveq  #1,y
 Moveq  #2,z
 Bra.S  @2
@1 Moveq#3,q

A clever compiler (or, more likely, assembly language programmer) could optimize this as:

 Tst    x
 Beq.S  @1
 Moveq  #1,y
 Moveq  #2,z
@1 Moveq#3,q

What’s happening here is that the two bytes generated by the Moveq #3,q instruction become the immediate data for the Trapn.W instruction in the SKIP_TWO_BYTES macro. Trapn.W is normally a 4 byte instruction but the macro only defines the first two bytes. Since it will never trap, the instruction decoder always ignores its operand (the Moveq #3,q instruction) and begins decoding the next instruction at @2 on the next clock. Works the same way for the Trapn.L instruction, except that in that case you embed exactly 4 bytes as the immediate data that will be skipped as part of the Trap instruction.

Note that to take advantage of this trick you’re usually going to want the smaller of the “if” clause and the “else” clause to be the “else” clause (to increase the chances that the “else” clause is 4 bytes or less). It would be nice if this was the most commonly executed of the two clauses, too, to take advantage of the faster branch-taken time. Hopefully compilers that have a “Generate 68020 code” flag will take advantage of this in the future (I don’t know of any at the moment that do).


Optimal saving and restoring of registers on the 040 is different than on other 680x0s. When loading registers from memory using the post-increment addressing mode:


you should use individual Move.L instructions instead. It will always be faster, no matter how many registers are involved (not exactly intuitive, is it?). When storing registers to memory with the pre-decrement addressing mode, as in:


you should use individual Move.L instructions unless your register list is comprised of: (1) exactly one data register and one address register or, (2) two or more address registers combined with any number (0..7) of data registers.


Three-word instructions with 32-bit immediate operands are faster than trying to use Moveq to preload the immediate value into a register first. The opposite is true on earlier 680x0s. For example, this code:

 Cmp.L  #20,(A0)

is faster on an 040 than this pair of instructions:

 Moveq  #20,D0
 Cmp.L  D0,(A0)

When subtracting an immediate value from an address register it is faster to add the negative value instead. This is because there is no complement circuit for the address registers in the 040. This instruction:

 Add    #-4,A0

is faster than either of these two:

 Lea    -4(A0),A0
 Sub    #4,A0

Bsr and Bra are faster than Jsr and Jmp because the hardware can precompute the destination address for Bsr and Bra.


There are some cases where it’s better to use a stack variable instead of a register variable. The reason is that source effective addresses of the form (An, d16) are just as fast as Rn once the data is the data cache. So the first read access to a stack variable will be slow compared to a register but subsequent reads of that variable will be equal in speed. By not assigning registers to your read-only stack variables (which includes function parameters passed on the stack) you save the overhead of saving/restoring the register as well as the time to initialize it.

You should, however, use register variables for variables that are written to. For instance, consider this function:

 Foo(w, x, p)
 int  w, x;
 int  *p;
 int  y, z;
 z = w;
 do {
 y += z + *p * w;
 *p += x / w + y;
 } while (--z);

 return (y);

In this example, w, x, y, z and *p are being read from (things on the right side of the equations) and y, z and *p are being written to. On the 040, you should make register variables out of those things that are being written to and leave the rest as stack variables:

 Foo(w, x, p)
 int  w, x;
 register int  *p;
 register int  y, z;
 z = w;
 do {
 y += z + *p * w;
 *p += x / w + y;
 } while (--z);

 return (y);

This second version is faster than the original version (as you would expect) but it is also faster than a version where w and x are declared as register variables (which you might not expect).


When it came to floating point operations, the 040 designers looked at their trace data and decided to implement in silicon any instruction that made up more than 1% of the 68881/2 code base. The remaining [uncommon] instructions were implemented in software. Those implemented in silicon are:

 FAdd, FCmp, FDiv, FMul, FSub
 FAbs, FSqrt, FNeg, FMove, FTst
 FBcc, FDbcc, FScc, FTrapcc
 FMovem, FSave, FRestore

They also made it so the Integer Unit and the Floating Point Unit operate in parallel, which means you should interleave floating-point and non-floating-point instructions as much as possible.

Here’s a table that summarizes the performance improvements made by having the FPU instructions executed by the 040 rather than by a 68882:

25MHz cycles

Instruction Addr mode 68882 68040

FMove FPn,FPn 21 2

FMove.D <EA>,FPn 40 3

FMove.D FPh,<EA> 44 3

FAdd FPn,FPn 21 3

FSub FPn,FPn 21 3

FMul FPn,FPn 76 5

FDiv FPn,FPn 108 38

FSqrt FPn,FPn 110 103

FAdd.D <EA>,FPn 75 3

FSub.D <EA>,FPn 75 3

FMul.D <EA>,FPn 95 5

FDiv.D <EA>,FPn 127 38

FSqrt.D <EA>,FPn 129 103

Notice that on the 040 an FMul is about 7x faster than an FDiv and on a 68882 it’s only about 1.4x faster. This suggests that you should avoid FDiv on an 040 much more than you would on a 68882. Perhaps your algorithms could be rewritten to take advantage of this when running on an 040.

A trick that works in some cases is to multiply by 1 over a number instead of dividing by a number. Take this code from a previous MacTutor article on random numbers:

quotientEQU FP0
newSeed EQU D1
result  EQU 8
LocalSize EQU  0

 Link   A6,#LocalSize
 Jsr    UpdateSeed
 FDiv.L #M,quotient
 Unlk   A6

By precomputing the floating point value OneOverM (1/M) and restricting ourselves to the optimized effective addressing modes we can rewrite this code to eliminate the Link, Unlk and FDiv:

OneOverMEQU "$3FE000008000000100000002"
quotientEQU FP0
newSeed EQU D1
result  EQU 4

 Jsr    UpdateSeed
 FMul.X #OneOverM,quotient
 Move.L result(A7),A0

This optimized version runs about 38% faster than the original overall (the relatively low improvement is caused by the fact that UpdateSeed is taking up most of the time). This example points out one other interesting thing, too, and that is the Move.L result(A7),A0 (an Integer Unit instruction) is running in parallel with the FMul instruction (an FPU instruction). Since the FMul takes longer, the FMove.X instruction at the end will have to wait for the FMul to finish before it does its move but there’s nothing we can do about that in this case.


The 040 has a 4K instruction cache and a 4K data cache. If you are performing some operation on a large amount of data, try to make your code fit in 4K or less (at least your innermost loop if nothing else) and try to operate on 4K chunks of contiguous data at a time. Don’t randomly read single bytes from a large amount of data if you can help it. This will avoid cache flushing and reloading as much as possible.

Many of the things I mentioned in the Efficient 68030 Programming article (Sept 92) about 16-byte cache lines apply to the 040 as well; it’s just that the 040 has more of them. Also, as mentioned in the 030 article, data alignment is majorly important on the 040 as well. Rather than repeat it all here, check out that previous article instead.


Most people have at least heard about the only new instruction that the 68040 provides but many people aren’t sure when they can use it. The rules are pretty simple: the source and destination addresses must be an even multiple of 16 and you must be moving 16 bytes at a time.

So when is this useful? Well, if you know you’re running in a 68040 environment (use Gestalt) then you know that the Memory Manager only allocates blocks on 16 byte boundaries (because that’s the way Apple implemented it). You can use this information to your advantage if you are copying data from one memory block to another.

Why not just use BlockMove you ask? Three reasons: (1) Trap overhead, (2) Job preflighting to find the optimal move instructions for the given parameters (which we already know are Move16 compatible) and, (3) It flushes the caches for the range of memory you moved every time you call it.

Why does it flush the caches? Because of the case where the Memory Manager has called it to move a relocatable block that contains code (the MM doesn’t know anything about the contents of a block so it has to assume the worst). This one case imposes an unnecessary penalty on your non-code BlockMoves (99% of all moves, I would guess) and it is this author’s opinion that Apple should provide a BlockMoveData trap that doesn’t flush the caches and that would only be called when the programmer who wrote the code knew that what was being moved was not code (and deliberately made a call to BlockMoveData instead of BlockMove). Write your senator, maybe we can do some good here.

One other thing to note about the Move16 instruction is that unlike other Move instructions it doesn’t leave the data it’s moving in the data cache. This is great if you’re moving a large amount of data that you’re not going to manipulate afterwards (like updating a frame buffer for the screen or something) but may not be what you want if you’re about to manipulate the data that you’re moving (where it might be advantageous to have it in the cache after it’s been moved). There is no rule of thumb on this because it depends on how much data you have and how much manipulation you’re going to do on it after it’s moved. You’ll have to run some tests for your particular case.

Well, that’s all the tips and tricks I know for programming the 68040. I’d like to thank the friendly and efficient people at Motorola for source material in producing this article as well as for producing such an awesome processor. I am truly a fan. With any luck at all the 80x86 camp will writher away and die and 680x0’s will RULE THE WORLD! Thanks also to RuleMaster Hansen for his code, clarifications, corrections and rules.


Community Search:
MacTech Search:

Software Updates via MacUpdate

Microsoft Office 2016 16.11 - Popular pr...
Microsoft Office 2016 - Unmistakably Office, designed for Mac. The new versions of Word, Excel, PowerPoint, Outlook, and OneNote provide the best of both worlds for Mac users - the familiar Office... Read more
Adobe Photoshop CC 2018 19.1.2 - Profess...
Photoshop CC 2018 is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous Photoshop customer). Adobe Photoshop CC 2018, the industry standard... Read more
Adobe Dreamweaver CC 2018 -...
Dreamweaver CC 2018 is available as part of Adobe Creative Cloud for as little as $19.99/month (or $9.99/month if you're a previous Dreamweaver customer). Adobe Dreamweaver CC 2018 allows you to... Read more
Adobe Flash Player - Plug-in...
Adobe Flash Player is a cross-platform, browser-based application runtime that provides uncompromised viewing of expressive applications, content, and videos across browsers and operating systems.... Read more
Drive Genius 5.2.0 - $79.00
Drive Genius features a comprehensive Malware Scan. Automate your malware protection. Protect your investment from any threat. The Malware Scan is part of the automated DrivePulse utility. DrivePulse... Read more
MegaSeg 6.0.6 - Professional DJ and radi...
MegaSeg is a complete solution for pro audio/video DJ mixing, radio automation, and music scheduling with rock-solid performance and an easy-to-use design. Mix with visual waveforms and Magic... Read more
ffWorks 1.0.7 - Convert multimedia files...
ffWorks (was iFFmpeg), focused on simplicity, brings a fresh approach to the use of FFmpeg, allowing you to create ultra-high-quality movies without the need to write a single line of code on the... Read more
Dash 4.1.5 - Instant search and offline...
Dash is an API documentation browser and code snippet manager. Dash helps you store snippets of code, as well as instantly search and browse documentation for almost any API you might use (for a full... Read more
Evernote 7.0.3 - Create searchable notes...
Evernote allows you to easily capture information in any environment using whatever device or platform you find most convenient, and makes this information accessible and searchable at anytime, from... Read more
jAlbum Pro 15.3 - Organize your digital...
jAlbum Pro has all the features you love in jAlbum, but comes with a commercial license. You can create gorgeous custom photo galleries for the Web without writing a line of code! Beginner-friendly... Read more

Latest Forum Discussions

See All

Around the Empire: What have you missed...
Oh hi nice reader, and thanks for popping in to check out our weekly round-up of all the stuff that you might have missed across the Steel Media network. Yeah, that's right, it's a big ol' network. Obviously 148Apps is the best, but there are some... | Read more »
All the best games on sale for iPhone an...
It might not have been the greatest week for new releases on the App Store, but don't let that get you down, because there are some truly incredible games on sale for iPhone and iPad right now. Seriously, you could buy anything on this list and I... | Read more »
Everything You Need to Know About The Fo...
In just over a week, Epic Games has made a flurry of announcements. First, they revealed that Fortnite—their ultra-popular PUBG competitor—is coming to mobile. This was followed by brief sign-up period for interested beta testers before sending out... | Read more »
The best games that came out for iPhone...
It's not been the best week for games on the App Store. There are a few decent ones here and there, but nothing that's really going to make you throw down what you're doing and run to the nearest WiFi hotspot in order to download it. That's not to... | Read more »
Death Coming (Games)
Death Coming Device: iOS Universal Category: Games Price: $1.99, Version: (iTunes) Description: --- Background Story ---You Died. Pure and simple, but death was not the end. You have become an agent of Death: a... | Read more »
Hints, tips, and tricks for Empires and...
Empires and Puzzles is a slick match-stuff RPG that mixes in a bunch of city-building aspects to keep things fresh. And it's currently the Game of the Day over on the App Store. So, if you're picking it up for the first time today, we thought it'd... | Read more »
What You Need to Know About Sam Barlow’s...
Sam Barlow’s follow up to Her Story is #WarGames, an interactive video series that reimagines the 1983 film WarGames in a more present day context. It’s not exactly a game, but it’s definitely still interesting. Here are the top things you should... | Read more »
Pixel Plex Guide - How to Build Better T...
Pixel Plex is the latest city builder that has come to the App Store, and it takes a pretty different tact than the ones that came before it. Instead of being in charge of your own city by yourself, you have to work together with other players to... | Read more »
Fortnite Will Be Better Than PUBG on Mob...
Before last week, if you asked me which game I prefer between Fortnite Battle Royale and PlayerUnknown’s Battlegrounds (PUBG), I’d choose the latter just about 100% of the time. Now that we know that both games are primed to hit our mobile screens... | Read more »
Siege of Dragonspear (Games)
Siege of Dragonspear 2.5.12 Device: iOS Universal Category: Games Price: $9.99, Version: 2.5.12 (iTunes) Description: Experience the Siege of Dragonspear, an epic Baldur’s Gate tale, filled with with intrigue, magic, and monsters.... | Read more »

Price Scanner via

Sunday Sales: $200 off 13″ Touch Bar MacBook...
Amazon has new 2017 13″ 3.1GHz Touch Bar MacBook Pros on sale this weekend for $200 off MSRP, each including free shipping: – 13″ 3.1GHz/256GB Space Gray MacBook Pro (MPXV2LL/A): $1599.99 $200 off... Read more
B&H drops prices on 15″ MacBook Pros up t...
B&H Photo has dropped prices on new 2017 15″ MacBook Pros, now up to $300 off MSRP and matching Adorama’s price drop yesterday. Shipping is free, and B&H charges sales tax for NY & NJ... Read more
Apple restocks Certified Refurbished 2017 13″...
Apple has restocked Certified Refurbished 2017 13″ 2.3GHz MacBook Pros for $200-$230 off MSRP. A standard Apple one-year warranty is included with each MacBook, models receive new outer cases, and... Read more
13″ Space Gray Touch Bar MacBook Pros on sale...
Adorama has new 2017 13″ Space Gray Touch Bar MacBook Pros on sale for $150 off MSRP. Shipping is free, and Adorama charges sales tax in NY & NJ only: – 13″ 3.1GHz/256GB Space Gray MacBook Pro (... Read more
Best deal of the year on 15″ Apple MacBook Pr...
Adorama has New 2017 15″ MacBook Pros on sale for up to $300 off MSRP. Shipping is free, and Adorama charges sales tax in NJ and NY only: – 15″ 2.8GHz Touch Bar MacBook Pro Space Gray (MPTR2LL/A): $... Read more
Save $100-$150+ on 13″ Touch Bar MacBook Pros...
B&H Photo has 13″ Touch Bar MacBook Pros on sale for $100-$150 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 13″ 3.1GHz/256GB Space Gray MacBook Pro... Read more
Current deals on 27″ Apple iMacs, models up t...
B&H Photo has 27″ iMacs on sale for up to $150 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 27″ 3.8GHz iMac (MNED2LL/A): $2149 $150 off MSRP – 27″ 3... Read more
Thursday Deal: 13″ 2.3GHz MacBook Pro for $11...
B&H Photo has the 13″ 2.3GHz/128GB Space Gray MacBook Pro on sale for $100 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 13-inch 2.3GHz/128GB Space... Read more
How to save $100-$190 on 10″ & 12″ iPad P...
Apple is now offering Certified Refurbished 2017 10″ and 12″ iPad Pros for $100-$190 off MSRP, depending on the model. An Apple one-year warranty is included with each model, and shipping is free: –... Read more
Silver 12″ 1.3GHz MacBook on sale at B&H...
B&H Photo has the 2017 12″ 1.3GHz Silver MacBook on sale for $1399.99 including free shipping plus sales tax for NY & NJ residents only. Their price is $200 off MSRP, and it’s the lowest... Read more

Jobs Board

Art Director, *Apple* Music + Beats1 Market...
# Art Director, Apple Music + Beats1 Marketing Design Job Number: 113258081 Culver City, California, United States Posted: 07-Mar-2018 Weekly Hours: 40.00 **Job Read more
*Apple* Solution Consultant - Apple (United...
# Apple Solution Consultant Job Number: 113569564 Williston, Vermont, United States Posted: 06-Mar-2018 Weekly Hours: 40.00 **Job Summary** Are you passionate about Read more
*Apple* Media Products (AMP) Engineering Man...
# Apple Media Products (AMP) Engineering Manager Job Number: 86497853 Santa Clara Valley, California, United States Posted: 07-Mar-2018 Weekly Hours: 40.00 **Job Read more
QA Automation Engineer, *Apple* Pay - Apple...
# QA Automation Engineer, Apple Pay Job Number: 113202642 Santa Clara Valley, California, United States Posted: 02-Mar-2018 Weekly Hours: 40.00 **Job Summary** At Read more
Lead *Apple* Solution Consultant - Apple (U...
# Lead Apple Solution Consultant Long Island NY Job Number: 113486035 Long Island City, New York, United States Posted: 07-Mar-2018 Weekly Hours: 40.00 **Job Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.