Aug 99 Tips
Volume Number: 15 (1999)
Issue Number: 8
Column Tag: Tips & Tidbits
Tips and Tidbits
by Jeff Clites, firstname.lastname@example.org
The June issue of MacTech highlighted some ways to move pixels around but there was no mention of a little-known PPC ASM instruction that can be used to speed pixel blitting (and all memory moving functions for that matter) called lmw/stmw.
When the data is aligned, it takes 4 cycles to execute a normal move memory. These two instructions are very powerful because they take 3 + n cycles to move n words. Hence each additional move only takes 1 cycle. This instruction works differently on different PPCs. The 601 treats these instruction like multiple lwa while the 603e, 604 and, G3 treat the instruction like a multimove.
There are a few restrictions on the following code:
- It works fastest when the baseAddrs are aligned.
- The code is for 8-bit pixels (tip: change r23 to 12 for 16 bit pixels and to 6 for 32 bit pixels).
- The width must be a multiple of 24 for 8 bit pixels (12 for 16 bit pixels, and 6 for 32 bit pixels). If it isn't there will be pixel overwriting. Most of the overwrite will be overwritten with the next scan line. You have been warned! (Any destination world should have a few more words in its baseAddr.)
- This assumes the same palette for the source and destination worlds (in 8-bit mode).
- This assumes the same depth for the source and destination worlds.
The following is an example of how to move pixels three times faster than the fastest method presented in Fast Blit Strategies.
tc SpeedCopy[TC], SpeedCopy[DS] ;TOC entry "SpeedCopy" for
;transition vector "SpeedCopy"
csect SpeedCopy[DS] ;Define transition vector "SpeedCopy"
dc.l .SpeedCopy[PR] ;Pointer to code
dc.l TOC[tc0] ;Pointer to TOC
# Prolog: SpeedCopy
;void SpeedCopy(long height, long width, long srcRowbytes,
; unsigned long *dest, long destRowbytes, unsigned long* src);
csect .SpeedCopy[PR] ;Prolog begins here
;r3 = dest.height
;r4 = dest.width
;r5 = dest.rowbytes
;r6 = dest.baseAddr
;r7 = src.rowbytes
;r8 = src.baseAddr
stmw r22,-36(SP) ;store temp register space
li r22, 1
li r23, 24
mr r12,r4 ;x = dest.width
mr r10,r6 ;tmpdest = dest
mr r11,r8 ;tmpsrc = src
lmw r25,0(r11) ;Move 4 + 4 + 4 + 4 + 4 + 4 from
;tempSource to r25 thru r31
subf. r12,r23,r12 ;Subtract num Pixels from total, test against 0
addi r11,r11,24 ;Add pixel width to tempSource even for
;different size pixels
stmw r25,0(r10) ;Move pixels from r25 thru r31 to screen
addi r10,r10,24 ;Add pixel width to dest even for different
bgt @pixelLoop ;Loop if the subtraction is greater than 0
subf. r3, r22, r3 ;Subtract one line from height, test against 0
add r8,r8,r7 ;Add src.rowbytes to src.baseaddr
add r6,r6,r5 ;Add dest.rowbytes to dest.baseaddr
bne @lineLoop ;Loop if height not equal to 0
lmw r22,-36(SP) ;Restore register space
With a few changes this code can pixel double, copy every other line, or both.
The best way to make a machine go faster is to make it do less. This is one of only a handful of cases where you can do more with less.