TweetFollow Us on Twitter

Fast Blit Strategies

Volume Number: 15 (1999)
Issue Number: 6
Column Tag: Programming Techniques

Fast Blit Strategies: A Mac Programmer's Guide

by Kas Thomas

Getting better video performance out of the Mac isn't hard to do - if you follow a few rules

Introduction

Ironically, the main performance bottleneck for game programmers today - as ten years ago - is getting pixels up on the screen. With the advent of 100 MHz bus speeds, built-in hardware support for 2D/3D graphics acceleration, megabyte-sized backside caches, and superior floating-point performance, you'd think screen refresh rates would no longer be an issue. But as CPU and bus speeds have increased, so has monitor resolution - and pixel throughput. Providing the user with cinematic animation at full screen resolution remains a formidable challenge.

Because of human interface concerns, writing direct-to-screen has always been treated as something of a taboo in the Mac world. QuickDraw was invented to save us from having to resort to such low-level techniques. But there are still times when writing directly to video memory makes sense, particularly in game programming, where anything goes when it comes to user interface design. In this article, we won't shy away from direct-device writing or treat it as a taboo subject; in fact, we'll concentrate on it, with a view toward optimizing our code for the G3 (and soon, G4) chip architecture. We'll talk about assembly language, cache issues, line-skip blitting, and how to customize QuickDraw without patching any traps (among other subjects). In order to keep the pace brisk, we'll assume that you already know what a GWorld is, how to manipulate PixMaps, and the basics of display modes. If you need to brush up on these items, a good crash course can be found in Dave Mark's Mac Programming FAQs book (IDG Books, 1996).

Snappy Screen Drawing

First, let's summarize the basics. (If any of the following sounds unfamiliar, you should probably read up on video device fundamentals.) It should go without saying that maximizing screen drawing performance usually means taking advantage of one or more - or possibly all - of the following techniques:

  • Use 8-bit color instead of 32-bit (which cuts bus traffic by 75%).
  • Cache and redraw dirty rects only (so you don't repaint more territory than necessary). In games where most of the screen's pixels don't change from frame to frame, it pays to just keep track of the regions that need redrawing, and only redraw those regions.
  • Use pixel-skip draw techniques. This means implementing your sprite-drawing in such a way as to draw only the non-empty pixels in a sprite, skipping over "underlay" areas. But instead of inspecting values in a mask, you can get extra performance by implementing a "run length" approach wherein runs of visible sprite bytes are packed together. The idea is to inspect the run-length byte (like the first byte of a Pascal string) and draw that many bytes; then inspect the skip-length byte of the next (empty) run, and skip over that many bytes; and so on. If you can just inspect length bytes rather than mask bytes, you can save cycles.
  • Use line-skip draw routines. Simply put, this means drawing every other line of the image, the way an interlaced NTSC television picture is drawn. By simply omitting half the drawn data, you cut the redraw time in half. (The user sees a dithered image.) If the blit area is small enough, you may be able to write directly to the screen (without tearing or flashing) at vertical retrace time, instead of writing to a back buffer. (When you write to a back buffer, of course, you're writing everything twice: once to the buffer, once to the screen.)
  • Draw 64 bits at a time - or however many bits the architecture will support. Someday there will doubtless be a 128-bit "long double" or "double double," the way there is now a 64-bit "long long." (If you don't know about long longs, consult your compiler documentation.) Until then, for best performance, you should always copy data to the screen as 64-bit doubles - never as anything shorter. All PPC chips have thirty-two floating-point registers and all can load a 64-bit double in one CPU cycle, so it makes sense to take advantage of the throughput potential that the architecture offers. Anything less represents wasted cycles.
  • Observe proper data boundary alignment. (Write to and from addresses that are evenly divisible by 4, 8, or 16 - whatever is appropriate to the architecture and the drawing mode.) Also try to make all window and sprite dimensions a multiple of 16 or 32. Most graphics accelerator boards are designed to deliver their best performance when this is the case.
  • Access data linearly (by incrementing pointers); avoid pointer arithmetic involving multiplications. Some applications even go so far as to maintain tables of line-start addresses, so that pointer addresses can be accessed via table lookup instead of calculated on the fly. (Depending on the chip architecture and cache performance, this tactic will either work like a charm or generate pipeline stalls.)
  • Use wide, shallow graphic elements in preference to tall, narrow ones. (There are more raster lines, and therefore more pointer arithmetic, in tall graphics.)
  • Implement your own custom drawing routines where appropriate, including, possibly, a replacement for CopyBits().

Getting the Most out of CopyBits

The Mac's main general-purpose blit utility is, of course, QuickDraw's venerable CopyBits() routine. Because so many OS and user processes rely so heavily on it, and because the entire Mac user experience hinges on its performance, CopyBits() has been very highly optimized. The bottom line is that CopyBits() gives very good performance and is actually quite hard to improve upon, if it's used properly.

To get the best performance from CopyBits(), you have to observe a few ironclad rules:

First, make sure the source and destination rectangles are exactly the same dimensions. One of the capabilities CopyBits() was designed to offer is dynamic image resizing with dithering and antialiasing. (This can actually be a very handy thing, in situations where you care more about antialiasing than speed.) If you provide source and destination Rects that are different sizes, CopyBits() stretches or shrinks the output accordingly and antialiases the result. But this means taking a major speed hit. So if performance is critical, don't make QuickDraw "dither down" your image.

Secondly, use a nil maskRgn. Again, one of the general-purpose capabilities of CopyBits() is to allow on-the-fly masking of image areas. But this, too, exacts a speed penalty. If you must do masking via QuickDraw, use CopyMask(); don't pass a maskRgn to CopyBits(). You'll find that CopyMask() does much faster masked blits. (Trivia note: Don't forget that CopyMask is one of a handful of QuickDraw calls that cannot be "recorded" between calls to OpenPicture and ClosePicture. If you need to make a PICT, use CopyBits.)

Thirdly, be sure source and destination PixMaps are 32-bit (or better yet, 64-bit) aligned. They should also have the same pixel depth (same color mode). And your transfer mode should be srcCopy, which is a direct load-and-store mode, as opposed to the arithmetic modes that allow various types of pixel blending.

Finally, be certain that the color tables are the same for the source and destination PixMaps. CopyBits() always examines the ctSeed field of the source and destination color tables to see if they differ (in which case color-table mediation will be called for). For best performance, coerce the ctSeed field of the source and destination color tables to the same value, with the following ghastly but essential C expression:

(*( (*(srcPixMap) )->pmTable) )->ctSeed =
	(*( (*( (*aGDevice)->gdPMap) )->pmTable) )->ctSeed;

Remember that CopyBits() always checks these two seed values. If they are not the same, QuickDraw will waste time translating color table info, which you don't want.

If you observe the foregoing rules, you will find that CopyBits() is quite hard to improve upon as a general blit routine. Hard, but not impossible. It turns out that if you write directly to the screen yourself, bypassing CopyBits(), you can sometimes achieve a 5% to 10% speed gain - but only if you ignore color tables, write 8-byte doubles, stay on properly aligned addresses, and keep source and destination rectangles the same size (with the width a multiple of 8). In other words, you have to make your code a good deal less general than CopyBits().

Direct-to-Video

Writing direct-to-screen on the Mac is not difficult. First you have to get the starting address of video memory, which can be done as follows:

PixMapHandle pmh;
Ptr 		videoMemoryAddr;
GDHandle 	mainDevice;

mainDevice = GetMainDevice();
pmh = (**mainDevice).gdPMap;
videoMemoryAddr = GetPixBaseAddr( pmh );

The Mac's screen is just a glorified PixMap, and a handle to this PixMap is contained in the the GDevice record of each display device (and in the GrafPort record of every open window, incidentally). For safety, use the MacOS function GetPixBaseAddr() to get the base address.

Next, figure out the offset from the top left corner of the screen to the top left corner of the area you want to begin writing to. Multiply the raster-line start position (the global 'y' coordinate) by the screen's rowBytes value; then, to offset horizontally, add the desired horizontal start position multiplied by the pixel size (which will be one byte for 8-bit color, two for 16-bit color, and four for 32-bit color; but the pixelSize field of the PixMap gives the size in bits, not bytes, so divide by 8). Let's say you want to write to a starting position (in global coords) of [64, 100], which is to say 64 pixels from the left edge of the screen and 100 pixels down from the top. For this, you would do:

long horizOffset = 64, verticalOffset = 100;
Ptr writeAddr;

writeAddr = videoMemoryAddr; // obtain screen origin address as shown above
writeAddr += verticalOffset * ((**pmh).rowBytes & 0x3FFF);
writeAddr += horizOffset * ((**pmh).pixelSize/8);
	
// Note: PixelSize is in bits, not bytes. Divide by 8 to convert to bytes.

The rowBytes value tells you the number of bytes in one complete raster line, including any padding that QuickDraw might need for data alignment. The mask operation involving 0x3FFF requires a bit of explanation, if you're new to Mac programming. The first two bits of rowBytes are reserved for System use. The MacOS inspects these bit values to determine whether the pixel data are in the form of a black-and-white BitMap, or a true (color) PixMap. (This is an early Color QuickDraw hack, needed in order for PixMaps and BitMaps to be used interchangeably in Color QD routines. When the original B&W QuickDraw was first written, there were only BitMaps.) The important point is, don't forget to mask rowBytes against the hex value 0x3FFF in order to determine the true number of bytes in a raster line. If you fail to do this, you'll get strange bugs, because the raw value of rowBytes will usually be negative (rowBytes is a signed short int).

Once you've calculated a start address, you can write to it - preferably 64 bits at a time. Determine how many pixels' worth of data you'll need to write, horizontally, then loop through raster lines, writing 64 bits at a time, as shown in Listing 1.

Listing 1: FastBlit( )

FastBlit()

Note: On entry, this function expects source and destination pointers to be precalculated (to reflect the locations of the upper left corners of the source and destination "write" areas); no pointer arithmetic is done inside this function. Also note that the blit area's width (in bytes) must be evenly divisible by 8 - a concession to speed.

void FastBlit(long depth, long doublesWide, 
			  Ptr GWorldAddr, Ptr screenAddr, 
			  long offRowBytes, long screenRowBytes )
{
	double *dst = (double *) screenAddr;
	double *src = (double *) GWorldAddr;
	long doublesAcross;
	long screenSkip, offscreenSkip;
	
	screenSkip = screenRowBytes/8 - doublesWide;
	offscreenSkip = offRowBytes/8 - doublesWide;
			
	do {	
		doublesAcross = doublesWide;		
		do { 
			*dst++ = *src++; 
			} while ( - doublesAcross) ;
			
		src += offscreenSkip;
		dst += screenSkip;
		
		} while ( - depth);
}

In this scenario, GWorldAddr is the source address for an offscreen GWorld. There is no need to make local copies of the input parameters, since the compiler will pass the values in registers (assuming you're compiling to a PPC target, that is). We set up do while (rather than for or while) loops to achieve smaller, tighter executable code; and we cast our source and destination pointers to pointers-to-double so that we can write 64 bits at a time. We also take care to access data linearly, eliminating multiplications from our pointer arithmetic. Result: The code shown above is 5% to 10% faster than CopyBits(), depending on monitor mode and image dimensions. That's not much of a speed improvement, admittedly, but if you need it, it's there.

Optimizing Blit Code for PPC

If you grew up writing code for CISC chips, it might seem as though the code in Listing 1 could be optimized a bit further. First, why not declare all local variables as register variables? Secondly, why not unroll the inner loop? For that matter, why not write the whole thing in assembly language?

The reason we don't declare any register variables in our blit routine is that the compiler already knows to put everything in registers. (If you don't believe it, do an assembly dump.) Using the "register" keyword gets us no additional speed because on the PowerPC, almost everything is done in registers by default. Recall that the PPC chips all have 32 general-purpose 32-bit registers and another 32 "wide" (64-bit) floating-point registers. The first 8 integer registers and 13 floating-point registers are available for argument-passing, and most compilers will pass function parameters in these registers rather than on the stack. Likewise, if there are less than 224 bytes of local variables inside a function, the compiler will try to put all local variables in registers. The stack is avoided at all costs, because it means going out to the data bus, which on many computers runs at only 25% of the CPU speed.

The fastest way for code to execute is for all data and all code to stay inside the CPU at all times, where things happen at clock speed. Toward this goal, the designers of the G3 (PPC 750 series) chips put 32K of data cache and 32K of instruction cache on board the chip itself, so that the most recently used code and data can be accessed at clock speed. Of course, 32K isn't big enough to hold all your code or all your data, which is why the chip designers put a generous secondary cache (typically 512K or 1Mb) on the back side of the chip - the so-called "backside" cache. This cache is big enough to hold quite a bit of data - even some entire images - but to access it requires that you step down to one-half CPU clock speed. That's a big speed hit, but it's still not as bad as having to go out to DRAM via the main bus. On most Macs these days, the bus runs at either 66 MHz or 100 MHz. If your CPU is constantly requesting data from RAM, your computer is essentially running at 66 or 100 Mhz, not the 300 or 400 MHz that the CPU may theoretically be capable of.

What it means is that you should group your main performance routines together, so that they stay in the cache; avoid static variables that require frequent trips to the bus; and be careful about unrolling loops. If you unroll a loop too far, it could fall out of the cache - in which case, you just scored a 50% speed hit.

Incidentally, if you want your routines to be close to each other in the cache, group them together sequentially in your C source. The Metrowerks compiler puts executables together in the order you write them. There is no need to use the segment pragma; in fact, that pragma only works when compiling to a 68K target.

Pipelining

Another important consideration on PPC targets is pipelining. The processing units of the G3 chips have separate facilities for fetching, decoding, and executing instructions. These facilities are designed to operate concurrently, which is to say that while one instruction is executing, the next one is being fetched and another one is being decoded - under ideal circumstances. When data and code can be fetched directly from the chip's onboard (32K) cache areas, circumstances are pretty close to ideal and the PPC pipeline can process one instruction per clock cycle. But when data has to be fetched from RAM via the bus, everything screetches to a halt as the CPU waits for data to arrive. This is called a pipeline stall.

A good compiler will analyze your code and anticipate possible pipeline stalls, then try to interleave or reorder instructions as needed to give the CPU something to do while data is being retrieved. But you can easily thwart the compiler's best efforts by, for example, insisting on putting one load/store operation after another after another in your code - i.e., by unrolling a data-copy loop.

Take our blit routine, for example. An assembly dump of the main loop from Listing 1 is shown in Listing 2. (For clarity, we've omitted half a dozen lines of setup code.) Note first of all that the assembly language for our nested double loop is only eleven lines long, which is not bad. (We save a few lines by using the do while construct in place of a for loop.) The first line is a register-move (mr) that loads our inner-loop counter variable into r8. The second line is a load-floating-double (lfd) instruction using the (source) address stored in r5. But notice one thing: The "write" (or store-floating-double: stfd) instruction doesn't occur until three lines later. In between the load and store instructions are an add-immediate (addi) and a subtract-immediate-with-carry (subic) operation. The add operation corresponds, of course, to a pointer post-increment in C, while the subtract-with-carry is a decrement of our loop counter. After the store operation comes another pointer post-increment (notice that the address is increased by 8, because we're operating on doubles), then the branch-if-not-equal instruction.

What's happened here is that the compiler has decided (quite correctly) that while the load instruction is executing, the processor might just as well do some pointer and loop-counter arithmetic before executing the store instruction, because the load will take a while (requiring a RAM access - or perhaps a backside-cache access). Since the chip has separate load/store and processing units, these operations can occur concurrently. In other words, the intervening arithmetic operations between the read (load) and write (store) cost us nothing. Meanwhile, the chip's branch unit has been watching the "carry bit" that was (or wasn't) set during the loop-counter decrement (the subic instruction), so that by the time we get to the branch point, the chip's branch unit already "knows" where to take us next. Thus, the branch costs us nothing. (On the PPC 750, the branch unit operates concurrently with processing units.) This is a good example of how instruction interleaving can be exploited for maximum performance on a PPC host. There are no pipeline stalls, because processing continues even while a RAM access is taking place.

Listing 2: FastBlit( ) Disassembled

FastBlit.asm

Note: This is PPC assembly code generated by Metrowerks compiler. Comments by the author. (See article text for discussion.)

00000020:   mr       r8,r4		; loop counter setup
00000024:   lfd      fp0,0(r5)	; read from input
00000028:   addi     r5,r5,8		; src pointer post-increment
0000002C:   subic.   r8,r8,1		; decrement loop counter
00000030:   stfd     fp0,0(r6)	; write output
00000034:   addi     r6,r6,8		; dst pointer post-increment
00000038:   bne      *-20		; loop condition test
0000003C:   add      r5,r5,r7	; pointer offset arithmetic
00000040:   add      r6,r6,r0	; pointer offset arithmetic
00000044:   subic.   r3,r3,1		; decrement loop counter
00000048:   bne      *-40		; loop condition test

Now let's consider what happens when we try to unroll the loop. Take a look at Listing 3, which is an assembly dump of a version of Listing 1 in which the inner loop has been unrolled four times.

It may not seem like it at first, but this code is nowhere near as efficient as that of Listing 2. The reason is that the many close-together load/store operations are almost certain to generate pipeline stalls. A little profiling confirms that there is no speed gain from unrolling the loop.

Listing 3: Unrolled Blit Disassembled

UnrolledBlit.asm

Note: This is PPC assembly code generated by Metrowerks compiler. See article text for discussion.

00000018:   mr       r0,r4
0000001C:   lfd      fp0,0(r5)	; read
00000020:   addi     r5,r5,8		; bump
00000024:   stfd     fp0,0(r6)	; write (stall)
00000028:   addi     r6,r6,8		; bump
0000002C:   lfd      fp0,0(r5)	; read (stall)
00000030:   addi     r5,r5,8		; bump
00000034:   stfd     fp0,0(r6)	; write (stall)
00000038:   addi     r6,r6,8		; bump
0000003C:   lfd      fp0,0(r5)	; read (stall)
00000040:   addi     r5,r5,8		; bump
00000044:   stfd     fp0,0(r6)	; write (stall)
00000048:   addi     r6,r6,8		; bump
0000004C:   lfd      fp0,0(r5)	; read (stall)
00000050:   addi     r5,r5,8		; bump
00000054:   stfd     fp0,0(r6)	; write (stall)
00000058:   addi     r6,r6,8
0000005C:   subic.   r0,r0,1
00000060:   bne      *-68
00000064:   slwi     r0,r7,3
00000068:   add      r5,r5,r0
0000006C:   slwi     r0,r8,3
00000070:   add      r6,r6,r0
00000074:   subic.   r3,r3,1
00000078:   bne      *-96

Customizing QuickDraw

Most of the time, you'll be hard pressed to beat CopyBits(). But if you do manage to beat CopyBits(), you can (and should) consider installing your own blit routine as a QuickDraw bottleneck proc, replacing CopyBits(). Maybe you didn't know it, but QuickDraw is extensible (thanks to some nice design work, circa 1983, by Bill Atkinson). There are 13 low-level "bottleneck" functions that QuickDraw uses to do things like draw lines, rectangles, ovals, etc. (See Table 1.) One of the standard bottleneck primitives is called StdBits(). This is the low-level blit function that CopyBits() ultimately vectors to. You can install your own replacement function here, and QuickDraw will automatically vector to it when your program needs to call on CopyBits(). This is similar to patching a trap, except that Apple (or Atkinson) designed the QD bottleneck jump table to be wholesale-replaceable on a window-by-window basis. By "wholesale-replaceable," we mean that the entire jump table (containing addresses for all of the QD drawing primitives) can and indeed must be replaced at once. The relevant data structure is the CQDProcs struct (see Listing 4).

Table 1: QuickDraw Bottleneck Proc Prototypes

pascal void StdText(short byteCount, Ptr textBuf, Point numer, Point denom);
pascal void StdLine(Point newPt);
pascal void StdRect(GrafVerb verb, const Rect *r);
pascal void StdRRect(GrafVerb verb, const Rect *r, short ovalWidth, short ovalHeight);
pascal void StdOval(GrafVerb verb, const Rect *r);
pascal void StdArc(GrafVerb verb, const Rect *r, short startAngle, short arcAngle);
pascal void StdPoly(GrafVerb verb, PolyHandle poly);
pascal void StdRgn(GrafVerb verb, RgnHandle rgn);
pascal void StdBits(const BitMap *srcBits, const Rect *srcRect, const Rect *dstRect, short mode, RgnHandle maskRgn);
pascal void StdComment(short kind, short dataSize, Handle dataHandle);
pascal short StdTxMeas(short byteCount, const void *textAddr, Point *numer, Point *denom, FontInfo *info);
pascal void StdGetPic(void *dataPtr, short byteCount);
pascal void StdPutPic(const void *dataPtr, short byteCount);

Listing 4: CQDProcs Data Structure

struct CQDProcs {
	QDTextUPP						textProc;
	QDLineUPP						lineProc;
	QDRectUPP						rectProc;
	QDRRectUPP						rRectProc;
	QDOvalUPP						ovalProc;
	QDArcUPP						arcProc;
	QDPolyUPP						polyProc;
	QDRgnUPP						rgnProc;
	QDBitsUPP						bitsProc;
	QDCommentUPP					commentProc;
	QDTxMeasUPP					txMeasProc;
	QDGetPicUPP					getPicProc;
	QDPutPicUPP					putPicProc;
	QDOpcodeUPP					opcodeProc;				
	UniversalProcPtr				newProc1;
	UniversalProcPtr				newProc2;
	UniversalProcPtr				newProc3;
	UniversalProcPtr				newProc4;
	UniversalProcPtr				newProc5;
	UniversalProcPtr				newProc6;
};
typedef struct CQDProcs CQDProcs, *CQDProcsPtr;

Installing a custom bottleneck proc is actually quite simple. Listing 5 shows how it's done. The key is to realize that every window has its own set of bottleneck procs, accessible through the grafProcs field of the GrafPort structure. You replace the entire bottleneck jump table (containing pointers to all 13 low-level drawing functions) all at once, even if you only need to customize just a single bottleneck procedure. When you no longer need your custom bottlenecks, simply nil out the grafProcs field of the window's GrafPort structure, and QuickDraw will know to revert to its own default proc table.

Listing 5: SetupCustomBottleneck()

SetupCustomBottleneck()

A function to attach a new set of QuickDraw procs to a window.

CQDProcs qdNewProcs; // globals

void SetupCustomBottleneck( CWindowPtr w) {

	SetStdCProcs( &qdNewProcs ); // fetch copy of default procs

			// Now replace CopyBits with our own custom routine:
	qdNewProcs.bitsProc = NewQDBitsProc( CustomBlit );
	w->grafProcs = &qdNewProcs; // install new procs	
}

Listing 6: CustomBlit()

CustomBlit()

WARNING: This is a custom routine that expects screen to be in 8-bit color mode; also, image must be 640x480. These numbers are hard-coded for speed. This is NOT a general-purpose routine. Use with caution.

void CustomBlit(BitMap *srcBits, 
				Rect *srcRect,
				Rect *dstRect,
				short mode,
				RgnHandle regionH) 
{
#pragma unused (srcRect,dstRect,mode,regionH)				
					
	double 		*dst; 
	double 		*src = (double *) srcBits->baseAddr;
	long 			rows;
	long 			yeaManyAcross;
	long			srcSkip, dstSkip;
	
	
// The following have all been previously cached in globals:
	dst = gDestAddr;
	rows = gRows;
	srcSkip = gSrcSkip;
	dstSkip = gDestSkip;
	
	// * * * * * * * * * * * BEGIN BLIT * * * * * * * * * * * 	
	do {  
		yeaManyAcross = 640/8;
		
		do { 
			*dst++ = *src++;
			} while (  - yeaManyAcross );
		dst += dstSkip;
		src += srcSkip;
	} while (  - rows );
	
	// * * * * * * * * * *  END BLIT  * * * * * * * * * * *
}

Listing 6 shows a custom blit routine, hard-coded as to window dimensions and bit depth, with certain key parameters pre-cached in globals. (With any luck, those values will stay in the data cache - or else the backside cache - for fast access when you most need them.) You can think of these globals as your very own "QuickDraw globals."

You may have noticed that the arguments to QuickDraw's low-level grafProcs don't include a destination address. That's because the procs apply only to the current window (the current GrafPort). It's assumed that you're writing into the current port. Remember, at this low level, there's no need to call SetPort()!

Using hard-coded values in a low-level routine without error checking is obviously somewhat dangerous, but it's necessary if you want maximum speed. Plus, you have to remember that for greater flexibility, you can - and should - develop multiple custom-draw functions, tailored to various circumstances, so that you can vector to the right one at the appropriate moment. Also, it's good to know that QuickDraw will use your custom blitter only in the window you specify. (Again, every window or GrafPort has its own grafProcs.) Thus, if the user temporarily leaves your game in order to visit the Finder or another application, the other application(s) will still draw correctly into their own windows. Likewise, if the user leaves the main gameplay window to look at a dialog window in your own program, the dialog will draw correctly, using QuickDraw's default proc table.

Bear in mind that you can replace any of the QuickDraw primitives you need to. For example, if your game could benefit from a special arc-drawing routine, you can install your own arcProc. If you've got something better or faster than the Bresenham algorithm, you can install your own ovalProc and/or lineProc, etc.

One of the benefits of replacing QuickDraw's grafProcs is that it lets you keep using native QD calls like LineTo and CopyBits in your code. This helps with code reusability as well as readability. After you've installed your own custom blitter in place of CopyBits, you can just keep calling CopyBits throughout your code. If you come up with a better blitter later on, you can update your code just by changing one grafProc pointer.

Extreme Measures

If you're looking for more than just a small incremental improvement over CopyBits (i.e., you want to be able to blit hundreds of 640x480-or-larger frames per second), you'll need to resort to extreme measures - such as (for example) line-skip drawing and/or pixel doubling.

To implement line-skip drawing (interlacing), you can just add a few lines of code to the custom blitter in Listing 1:

	if (gPolarity++ & 1L == 1L)
	{	src += offRowBytes/8; dst += screenRowBytes/8; }
	
	screenSkip += screenRowBytes/8;
	offscreenSkip += offRowBytes/8;

These lines should go immediately before the main (outer) loop. The static or global variable gPolarity will keep an "odd-even" counter going, the idea being that on odd-numbered calls to the blitter, you'll offset the source and destination pointers one raster line deep into the image. And every time the routine is called, you'll calculate the "skip" values to include one extra raster line, so that you draw every other line of the image. When you do this, of course, your redraw rate doubles, because now you're handling only half as many bytes of data.

The interlaced redraw technique works very well for underlays and slow-moving objects, but as you can probably imagine, it will yield ghosting artifacts if the object that's being drawn is moving across the screen at an appreciable rate. (With a little ingenuity, you can probably think of workarounds for this - or maybe put the effect to good use.)

Another common speed-multiplying technique is pixel doubling, which is where each pixel of the source image (whether it's a sprite, icon, underlay image, or whatever) is drawn as a 2x2 tile onscreen. Essentially, you're scaling a quarter-size image up to full size - hence, the potential exists for a 4:1 speed boost. The downside to this technique is that it gives a "chunky pixel" look that can be annoying; but there happens to be a useful workaround, in the form of the 'epx' antialiasing technique pioneered by LucasArts' Eric Johnston. Figure 1 shows how it works.


Figure 1. The 'epx' antialiasing algorithm. (see text for discussion).

The problem is this: how to scale pixel 'P' to be four times its original size, but without all four new pixels necessarily being the same color. (We want some antialiasing.) In Figure 1, the tic-tac-toe grid on the left represents the pixel 'P' and its north, south, east, and west neighbors in the source (offscreen buffer) image. The 2x2 tile on the right represents the new (onscreen) pixels, which will derive from P. The question is how to color P1, P2, P3, and P4 without giving the "chunky pixel" look.

The answer is to base P1 on the back-buffer pixels 'A' and 'C'; base P2 on 'A' and 'B'; base P3 on 'C' and 'D'; and base P4 on 'D' and 'B'. We say that 'A' and 'C' are the "parents" of P1, 'D' and 'B' are the parents of P4, etc., and likewise, P1 is the child of 'A' and 'C', and so forth. The rule to follow is this: If any two parents are equal in color, then make the child that color. Otherwise, let the child be the color of 'P'. Also, if at least three of the parents (A, B, C, D) are identical, do nothing: just let P1 = P2 = P3 = P4 = P. (Johnston discusses this technique on page 692 of Tricks of the Mac Game Programming Gurus, Hayden Books, 1995.)

The 'epx' technique is not written in stone; you can and should experiment with modifications to it, to suit your game's graphics. It's more of an ad hoc heuristic than a theory-based algorithm. But it gives worthwhile results. Many shipping games use this cheap, easy AA technique.

Compressed Graphics

An even more extreme way of speeding things up is to store your graphics in a highly compressed state and decompress them to the screen at blit time. This is the basis of most "byte-packed" or "run length" sprite drawing techniques. It's also how QuickTime gets much of its speed.

Imagine, for a moment, that you could store all of your game's static graphics (backgrounds, underlays, sprites of fixed dimensions, etc.) in 10:1 or 20:1 compressed form, in RAM; then, when you need to draw them, you send the compressed data over the bus to video memory, and have the image(s) decompress-to-screen with the aid of hardware support on the video board. This is exactly what happens on many accelerator boards that support QuickTime, and according to Dispatch No. 8 from the Ice Floe (Apple's Quicktime tech notes, available online in the QuickTime area of Apple's developer pages), you can take full advantage of this technique - hardware permitting - by simply using QuickTime's DecompressImage() call in place of CopyBits. Full details are available at <http://www.apple.com/quicktime/developers/icefloe/dispatch008.html>.

Sprocketry

If you're serious about obtaining better screen-redraw performance - whether for a game or for any other purpose - you owe it to yourself to investigate Apple's DrawSprocket library (see http://developer.apple.com/games/sprockets). The DrawSprocket routines are extremely logical and easy to use, and they greatly simplify things like blanking and restoring the screen (including the desktop and menu bar), setting the color depth, implementing gamma fades, double and triple buffering, blitting at controlled cinematic rates, etc. Also, the DrawSprocket library takes advantage of hardware support for page-flipping when available. Sprocket blitting is so efficient, you probably won't gain anything by installing a custom blit routine of your own (unless you're into extreme techniques), because the sprocket library will use a highly customized routine - customized for the exact graphic environment of your game.

For a variety of reasons, you should make it a point to investigate the DrawSprocket library, which is the "drawing" portion of Apple's Game Sprockets. (There are other sprockets for audio, networking, etc. - all of them royalty-free).

Conclusion

Although we didn't have time to discuss it, many of the techniques mentioned in this article are used in a small Metrowerks C code project called BlitsKrieg developed for this article and available online at <ftp://www.mactech.com>. Be forewarned that BlitsKrieg is simply a no-frills test program that draws a PICT to the screen numerous times, and reports the elapsed time in ticks. There is no event loop, no menu bar, etc. It's strictly a quickie prototype for testing different blit techniques, but it does contain example code showing how to replace CopyBits with a custom routine, how to do interlaced blits, and how to call QuickTime's DecompressImage. Use at your own risk.

In conclusion, I'd like to reiterate a bit of advice once given to me by a mentor who was (and still is) a ninja-level master at making code go fast. His chief insight, which I have benefitted from many times, is that since the CPU can only execute a fixed number of instructions per second, and no more than that number, there is really no such thing as "making the machine go faster." There is only such a thing as making the machine do less.

To go faster, do less. Remember that, next time you're slamming dumptruck-loads of pixels to the screen.


Kas Thomas <tbo@earthlink.net> has been a Macintosh user since 1984 and has been programming in C and assembly on the Mac since 1989. He is working on a variety of After Effects plug-ins and would like to hear from anybody who is doing the same.

 
AAPL
$97.19
Apple Inc.
+2.47
MSFT
$44.87
Microsoft Corpora
+0.04
GOOG
$595.98
Google Inc.
+1.24

MacTech Search:
Community Search:

Software Updates via MacUpdate

Firefox 31.0 - Fast, safe Web browser. (...
Firefox for Mac offers a fast, safe Web browsing experience. Browse quickly, securely, and effortlessly. With its industry-leading features, Firefox is the choice of Web development professionals... Read more
Little Snitch 3.3.3 - Alerts you to outg...
Little Snitch gives you control over your private outgoing data. Track background activityAs soon as your computer connects to the Internet, applications often have permission to send any... Read more
Thunderbird 31.0 - Email client from Moz...
As of July 2012, Thunderbird has transitioned to a new governance model, with new features being developed by the broader free software and open source community, and security fixes and improvements... Read more
Together 3.2 - Store and organize all of...
Together helps you organize your Mac, giving you the ability to store, edit and preview your files in a single clean, uncluttered interface. Smart storage. With simple drag-and-drop functionality,... Read more
Cyberduck 4.5 - FTP and SFTP browser. (F...
Cyberduck is a robust FTP/FTP-TLS/SFTP browser for the Mac whose lack of visual clutter and cleverly intuitive features make it easy to use. Support for external editors and system technologies such... Read more
iExplorer 3.4 - View and transfer all th...
iExplorer is an iPhone browser for Mac lets you view the files on your iOS device. By using a drag and drop interface, you can quickly copy files and folders between your Mac and your iPhone or... Read more
Airmail 1.4 - Powerful, minimal email cl...
Airmail is a powerful, minimal mail client.It was designed to retain the same experience with a single or multiple accounts and provide a quick, modern and easy-to-use user experience. Airmail... Read more
Macs Fan Control 1.1.12 - Monitor and co...
Macs Fan Control allows you to monitor and control almost any aspect of your computer's fans, with support for controlling fan speed, temperature sensors pane, menu-bar icon, and autostart with... Read more
A Better Finder Rename 9.37 - File, phot...
A Better Finder Rename is the most complete renaming solution available on the market today. That's why, since 1996, tens of thousands of hobbyists, professionals and businesses depend on A Better... Read more
MacBook Air EFI Firmware Update 2.9 - Fo...
MacBook Air EFI Firmware Update is recommended for MacBook Air (Mid 2011) models. This update addresses an issue where systems may take longer to wake from sleep than expected and fixes a rare issue... Read more

Latest Forum Discussions

See All

Ex-Angry Birds Developers Release Monsu...
Ex-Angry Birds Developers Release Monsu Teaser Trailer Posted by Jennifer Allen on July 23rd, 2014 [ permalink ] Finnish developer Boomlagoon has released a teaser trailer of their forthcoming side-scrolling action platformer, | Read more »
Lots of New Modes Have Been Added to Can...
Lots of New Modes Have Been Added to Canabalt Posted by Jennifer Allen on July 23rd, 2014 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
Stronghold 3: The Campaigns Review
Stronghold 3: The Campaigns Review By Jennifer Allen on July 23rd, 2014 Our Rating: :: DULL STRATEGIZINGiPad Only App - Designed for the iPad A cumbersome strategy game, Stronghold 3: The Campaigns has a few too many issues to... | Read more »
Table Tennis Touch on Sale for a Limited...
Table Tennis Touch on Sale for a Limited Time Posted by Jessica Fisher on July 23rd, 2014 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
Secret Files Tunguska Review
Secret Files Tunguska Review By Jennifer Allen on July 23rd, 2014 Our Rating: :: CONSPIRACY-LITTERED ADVENTURINGUniversal App - Designed for iPhone and iPad Offering traditional adventuring with no fear of in-app purchases, Secret... | Read more »
Celebrate Summer With a Cat in the Hat L...
Celebrate Summer With a Cat in the Hat Learning Library Sale Posted by Ellis Spice on July 22nd, 2014 [ permalink ] Universal App - Designed for iPhone and iPad | Read more »
Dragon Raiders Review
Dragon Raiders Review By Nadia Oxford on July 22nd, 2014 Our Rating: :: RUN, DRAGON, RUNUniversal App - Designed for iPhone and iPad Dragon Raiders is rough and scaly in some parts, but overall it’s an enjoyable level-based running... | Read more »
MyTaskList Review
MyTaskList Review By Jennifer Allen on July 22nd, 2014 Our Rating: :: EFFECTIVE IF PLAINUniversal App - Designed for iPhone and iPad It’s not the most stylish of task management apps, but MyTaskList has all the features you could... | Read more »
FlyCraft Herbie: Crazy Machines Review
FlyCraft Herbie: Crazy Machines Review By Jennifer Allen on July 22nd, 2014 Our Rating: :: TRICKY FLYINGUniversal App - Designed for iPhone and iPad A tough game of careful thrusting and navigation, FlyCraft Herbie: Crazy Machines... | Read more »
MTN Review
MTN Review By Jessica Fisher on July 22nd, 2014 Our Rating: :: ADORABLE, SERENE, AND AMUSINGUniversal App - Designed for iPhone and iPad MTN is an adorable, talking pet mountain that is less game and more zen garden.   | Read more »

Price Scanner via MacPrices.net

With The Apple/IBM Alliance, Is The iPad Now...
Almost since the iPad was rolled out in 2010, and especially after Apple made a 128 GB storage configuration available in 2012, there’s been debate over whether the iPad is a serious tool for... Read more
MacBook Airs on sale starting at $799, free s...
B&H Photo has the new 2014 MacBook Airs on sale for up to $100 off MSRP for a limited time. Shipping is free, and B&H charges NY sales tax only. They also include free copies of Parallels... Read more
Apple 27″ Thunderbolt Display (refurbished) a...
The Apple Store has Apple Certified Refurbished 27″ Thunderbolt Displays available for $799 including free shipping. That’s $200 off the cost of new models. Read more
WaterField Designs Unveils Cycling Ride Pouch...
High end computer case and bag maker WaterField Designs of San Francisco now enters the cycling market with the introduction of the Cycling Ride Pouch – an upscale toolkit with a scratch-free iPhone... Read more
Kingston Digital Ships Large Capacity Near 1T...
Kingston Digital, Inc., the Flash memory affiliate of Kingston Technology Company, Inc.,has announced its latest addition to the SSDNow V300 series, the V310. The Kingston SSDNow V310 solid-state... Read more
Apple’s Fiscal Third Quarter Results; Record...
Apple has announced financial results for its fiscal 2014 third quarter ended June 28, 2014, racking up quarterly revenue of $37.4 billion and quarterly net profit of $7.7 billion, or $1.28 per... Read more
15-inch 2.0GHz MacBook Pro Retina on sale for...
B&H Photo has the 15″ 2.0GHz Retina MacBook Pro on sale for $1829 including free shipping plus NY sales tax only. Their price is $170 off MSRP. B&H will also include free copies of Parallels... Read more
Apple restocks refurbished Mac minis for up t...
The Apple Store has restocked Apple Certified Refurbished Mac minis for up to $150 off the cost of new models. Apple’s one-year warranty is included with each mini, and shipping is free: - 2.5GHz Mac... Read more
Twelve South HiRise For MacBook – Height-Adju...
If you use your MacBook as a workhorse desktop substitute, as many of us do, a laptop stand combined with an external keyboard and pointing device are pretty much obligatory if you want to avoid... Read more
Why The Mac Was Not Included In The Apple/IBM...
TUAW’s Yoni Heisler cites Fredrick Paul of Network World whoi blogged last week that the Mac’s conspicuous absence from Apple and IBM’s landmark partnership agreement represents a huge squandered... Read more

Jobs Board

Sr Software Lead Engineer, *Apple* Online S...
**Job Summary** The Apple Online Store is looking for a highly motivated, entrepreneurial senior software engineer to join the Apple Online Store Publishing Systems Read more
*Apple* Solutions Consultant (ASC) - Apple (...
**Job Summary** The ASC is an Apple employee who serves as an Apple brand ambassador and influencer in a Reseller's store. The ASC's role is to grow Apple Read more
Sr. Product Leader, *Apple* Store Apps - Ap...
**Job Summary** Imagine what you could do here. At Apple , great ideas have a way of becoming great products, services, and customer experiences very quickly. Bring Read more
Sr Software Lead Engineer, *Apple* Online S...
Sr Software Lead Engineer, Apple Online Store Publishing Systems Keywords: Company: Apple Job Code: E3PCAK8MgYYkw Location (City or ZIP): Santa Clara Status: Full Read more
*Apple* Retail - Multiple Positions (US) - A...
Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.