TweetFollow Us on Twitter

Parallel Addition
Volume Number:10
Issue Number:10
Column Tag:Programmer’s Challenge in-depth

RGB to YUV Using Parallel Addition

Some unique approaches to optimization

By Robert Munafo, Malden, MA

Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.

About The Author

Robert Munafo - Robert works for VideoGuide, a startup in the Boston area. Prior to that, he developed drivers and embedded software for GCC Technologies’ printer products. Robert’s been writing free Mac software since 1984. One of the first public-domain games for the Mac, Missile, continues to run on every new model Apple releases! He also became well-known for his shareware effort Orion and the free utility Icon Colorizer. He spends most of his spare time on the Mac gronking the inner loops of the Mandelbrot Set and various compute-intensive simulations. He awaits the day when massively parallel desktop computers will surpass the TFLOPS (trillion floating-point operations per second) milestone. You can reach him via e-mail at mrob@world.std.com.

Doing more in fewer cycles

This article contains the actual winning code for July’s Color Space Conversion Programmer’s Challenge. Robert had sent in his code before the deadline but for some reason the SANE calls he made during his RGBtoYUVInit routine caused both my Macs to crash. I wasn’t able to identify the exact cause of the crash other than to witness that it wasn’t his code. I suspect it had to do with Omega SANE’s backpatching (self-modifying code) but I’m not sure. In any case, Robert was given a chance to submit a new version of RGBtoYUVInit only (which was not part of the timings) that didn’t use SANE. He did so and ended up being about 27% faster than the published winner Bob Noll. As you will see, Robert’s explanation and use of parallel addition is excellent (and fast!). I highly recommend studying it if you need to do fast matrix multiplication with constant coefficients.

- Mike Scanlin

When I created my entry for the July 1994 Programmers’ Challenge, I used some novel optimization techniques which are explained here. For background material, see the contest statement in the July 1994 issue, page 44, and the results presented in the September 1994 issue.

I will briefly restate the challenge. It involved converting a large number of [R,G,B] values into [Y,U,V] (a color system used in JPEG and NTSC, among others) using the formula:

Y 0.29900000 0.58700000 0.11400000 R

U = -0.16873590 -0.33126410 0.50000000 * G

V 0.50000000 -0.41868760 -0.08131241 B

Each entry consisted of an init routine that would not be timed and an RGBtoYUV routine that would take arrays of [R,G,B] values and output arrays of [Y,U,V] values. As always in the Programmers’ Challenge, accuracy is most important, followed by speed, code size, and elegance.

Analysis of Rounding

The first problem to solve involved figuring out how various types of rounding would affect computed results. The challenge required producing output that was equivalent to the results that would be produced when infinite precision is used. Fractions of N.5 (e.g. 2.5 or -2.5) would be rounded down, and anything else would be rounded to nearest. The problem clearly required the use of limited-precision integer math, so I had to figure out how much precision was necessary to produce acceptable results.

I conducted a brute-force search and determined that out of all possible Y, U, and V values produced by the transform matrix, the closest fraction to N.5 was N.499000 or N.501000. In other words, to distinguish an N.5 result from all other results, the math must be precise enough to distinguish differences as small as 0.001, or 2-10. Since 8 bits are used for the integer position of the answer, at least 18 bits are needed. The simplest way to get the right answer is to use 19 bits (or more) to compute the fraction, then add 0.4995, then chop off the fraction. This works because 0.4995 equals 0.5 - (0.001 / 2). Here are three examples:

2.501 + 0.4995 = 3.0005 -> 3

2.500 + 0.4995 = 2.9995 -> 2

2.499 + 0.4995 = 2.9985 -> 2

When considering negative values, there were technically two ways to interpret the problem’s statement of “.5 rounding down to zero”. The more likely interpretation is that we should round towards zero, with -2.5 rounding up to -2.0 and 2.5 rounding down to 2.0. However, consider what happens to the U values as the following sequence of [R,G,B] triples is transformed:

[R,G,B] U Rounded

8,8,2 -3.0 -3

8,8,3 -2.5 -2

8,8,4 -2.0 -2

8,8,5 -1.5 -1

8,8,6 -1.0 -1

8,8,7 -0.5 0

8,8,8 +0.0 0

8,8,9 0.5 0

8,8,10 1.0 1

8,8,11 1.5 1

8,8,12 2.0 2

As you can see, what ought to be a steady sequence of U values {-3, -2, -2, -1, -1, 0, 0, 1, 1, 2, ...} gets an added 0. This would be noticeable in certain “fountains” or smooth gradations of color - a small banding artifact would appear whenever the U or V axis is crossed in areas where R is equal to G. (This would be more noticeable after repeated transformations back and forth between PICT and JPEG - an average of 0.5 units of negative U would be lost each time). For this reason I decided to round towards negative infinity.

Algorithms & Techniques

The first two are pretty obvious: integer math and array lookups. If we deal with 2’s-complement values, the rounding towards infinity and 0.4995 rounding adjustment are easy to implement with integer math representing fixed-point fractions. Array lookups replace multiplication - since the generated code is for the 68000, there’s no hope of doing any type of multiplication faster than array lookups.

Array lookup is simply the technique of storing a multiplication table in memory. For example, one of the values we have to multiply by in this problem is 0.587. This value is multiplied by a G (green) pixel value that is between 0 and 255. So we create an array with 256 elements; to multiply 0.587 by the value 42, we look at the 42nd entry in the array. This type of operation can be much faster than a “real” multiplication.

Parallel Addition

This is the most important optimization idea I used. Here is an example of the technique: Imagine that we’re trying to add three pairs of decimal numbers:

42 38 17

+ 84 + 20 + 91

---- ---- ----

? ? ?

and suppose that we want to accomplish this with one addition operation. We can do it by forming each row of three numbers into a single large number and adding the large numbers together:

4200380017

+ 8400200091

------------

12600580108

In theory, 3 10-bit binary numbers could be added in parallel this way, using 32-bit variables to hold the values. With such an approach, the RGB-to-YUV conversion would be done like this:


/* 1 */
r -> lookup table -> yr.ur.vr
g -> lookup table -> yg.ug.vg
b -> lookup table -> + yb.ub.vb
 ----------
 Y  U  V

Each component of the RGB goes into a lookup table, to get the Y, U, and V components for that component of RGB. Then the nine Y, U, and V components are added together in one step to produce the final YUV.

Well, that’s great but We can only fit 10 bits of each into the 32-bit values that we’re adding together, and as described above we need 19 bits to calculate each component of YUV.

One solution to this is to have two sets of YUV components for each RGB component, with the first set giving the high 10 bits of the YUV components and the second set giving the low 10 bits. However, as we’ll see, we can’t get this many bits and get an accurate answer.

The other solution would be to perform 64-bit math. We will discuss both options after first discussing the issue of carry bits.

Carry Bits

Going back to the decimal example above, suppose we had tried to add our three numbers together this way:

 4238   17--->   423817
 + 84 + 20+ 91 --->+ 842091
 ---- --------   --------
 ? ?    ? 1265908

The sum 108 from the 17+91 part of the problem has interfered with the sum 58 from the 30+28 part. The same thing happens in binary, so when we do our additions of Y,U,V we have to leave enough room for carry bits.

Let’s return to the YUV problem and use 64-bit math. We have three arrays, each with 256 elements. The first array-lookup takes the R value as its index and yields a 64-bit wide value that has three 20-bit fields imbedded in it (corresponding to 0.299*R, -0.168*R, and 0.500*R) which are the “R component” of Y, U, and V; I called these fields yr, ur, and vr:


/* 2 */
        r -> lookup table ->     yr.ur.vr

The other two array-lookups are the analogous operation for G and B:


/* 3 */
        g -> lookup table ->     yg.ug.vg
        b -> lookup table ->     yb.ub.vb

Then you add it together to get Y, U and V. There are three rows of figures, or in other words three figures in each column. The worst case (from a carry or overflow point of view) would occur when all three figures in a given column had the maximum possible value (which would be 220-1). This doesn’t happen, but we get fairly close in the U column when R and G are both 1 and B is 255. In this case, the values ur and ug are close to 220-1 and ub is 219. When you add those together you get a value that is a little higher than 221 and therefore takes 22 bits to represent. (By the way - Since we’re using signed 2’s-complement representations for ur, ug and ub, we can safely treat the two high bits as overflow bits and discard them.)

If we could somehow reduce it to two figures in each column, we might be able to save one carry bit. Fortunately, we can. Notice that we are allowed enough memory in our temp buffer to store 65536-element arrays. We can have one array that contains YUV components for both R and G at the same time, pre-added. We still need a separate table for the B’s:


/* 4 */
    index_rg=((r<<8)+g)
  
    index_rg -> table ->   yrg.urg.vrg
           b -> table ->    yb. ub. vb

Now we only have two numbers in each column being added. The worst case is in the V column when R is 0, G and B are 1; vrg and vb are both close to 220-1 and their sum is close to 221-1, which can be represented in 21 bits. (Again, the extra bit is just a wrap-around overflow and can be safely ignored.)

Adding in Two 32-bit Parts

Now let’s consider the problem of doing the YUV conversion in two 32-bit pieces, with three figures per column as in the original scheme. The least significant half of the computation has to generate carry bits for each of the three columns, and you need 2 carry bits for each column. As a result the best you can do is 2+8, 2+8, 2+8 with 2 unused bits; in the top half you can discard carry bits so you can manage 9, 2+9, 2+9 with 1 unused bit. If you judiciously select the placement of unused bits, everything lines up right and you get a net result of 17 “useful” bits for the computation.

Seventeen bits isn’t enough, so it is now clear that we need to use the 65536-element arrays so we can get just two figures in each column to add. Here is the schematic for that:

                              high          low
    index_rg -> table ->   yrg.urg.vrg   yrg.urg.vrg
           b -> table -> +  yb. ub. vb    yb. ub. vb
                            ----------  ------------
                            Y   U   V <-(carry bits)

The “low” side on the right generates carry bits that are added to the “high” side to generate the result.

Consider the Y column for a moment: The actual values we need to add are 19 bits wide, and we split them into a 10-bit part and a 9-bit part - 10 bits in the high 32-bit addition, and 9 bits in the low 32-bit addition.

(By the way, notice that this is not the division of ordinal/fraction. There are eight bits in the ordinal (integer) part and 11 bits in the fractional part, because the entire 19-bit value needs to represent values from 0.000 to 255.999. So those high 10 bits contain 8 ordinal bits and 2 fractional bits.)

The U and V columns are similar. Here, the values being represented are signed, with one sign bit, seven bits in the ordinal part and 11 fractional bits (12 in the case of V), to represent values from -128.000 to 127.999. Again, these bits are divided between a 10-bit most-significant part in the high 32-bit side and a 9-bit (10-bit in the case of V) least-significant part in the low 32-bit side.

64-bit Math Wins

Now we’ve worked out two ways to perform the RGB-to-YUV conversion, but it isn’t too clear which is better. I wrote both and benchmarked them against each other, but we don’t have to do that to see which one wins.

The primary disadvantage of the first approach is propagating the carries from the low 32-bit part to the high 32-bit part. It requires a shift, mask and add. The following illustrates the idea with hypothetical 21-bit values (actual values would be 32 bits wide):


/* 5 */
 Sum of low parts of Y, U, and V:  CyyyyyyCuuuuuuCvvvvvv
 (C represents a carry bit)
 AND it with a constant:100000010000001000000
 to get:C000000C000000C000000
 now shift right by 6:  000000C000000C000000C
 and add to the “high” part:XyyyyyyXuuuuuuXvvvvvv
 (X represents an overflow bit that can be ignored)

With the 64-bit approach, the “low” parts in the computation consist of the entire 20-bit Y component (which needs no overflow bit because the Y computation is unsigned), plus 12 bits of the U component:

            uuuuuuuuuuuuyyyyyyyyyyyyyyyyyyyy

and the “high” parts contain the remaining 8 bits of U, an overflow bit for U, all 20 bits of V, an overflow bit for V and two unused bits:

            00XvvvvvvvvvvvvvvvvvvvvXuuuuuuuu

Only one carry bit has to be handled, to handle a carry from the low 12 bits of U to the high 8 bits of U. As it turns out, we can do this without even using up one of the 32 bits!

Here’s how that is accomplished: the two low parts (from the array-lookup) are added to generate a sum; if the sum is less than either of the addends then a carry has occurred and 1 is added to the high part. Doing this overflow test requires a compare and branch (for the test) plus an “increment” (a 68000 ADDQ instruction, which is faster than a normal add).

Notice that we were able to arrange the components Y, U and V in a more convenient manner. The Y component doesn’t need overflow bits because each of the Y values being added together are unsigned, and we know the maximum possible value of the sum is 255.999... So we put Y on the very right, and pack 12 bits of U into the rest of the bottom part. The other 8 bits of U are the ordinal (integer) part, and we put these in the low 8 bits of the high 32-bit word. These 8 bits are what we’ll write out to the U output buffer, and having them in the low 8 bits of our 32-bit word means that we won’t have to do a shift to get these bits. (We still need to do a shift for the Y and V values). It might help to show the 32-bit parts again, this time with the ordinal (integer) parts of Y, U and V in uppercase letters:


/* 6 */
   lower part:   uuuuuuuuuuuuYYYYYYYYyyyyyyyyyyyy
   upper part:   00XVVVVVVVVvvvvvvvvvvvvXUUUUUUUU

It should also be pointed out that with the 64-bit approach we can compute each component with 20 bits of accuracy, one bit more than we need. More accuracy never hurts, particularly if the transform matrix might need to change. (Remember, the error analysis was valid only for the specific matrix shown at the beginning of the article.)

Instruction Scheduling and Other Optimizations

Listing 1 shows the main loop from my RGBtoYUV routine before I optimized it. After implementing each of the two parallel addition methods described above (and another for testing accuracy) I began optimizing the code.

I optimized quite a bit by using forced type coercion all over the place in the array index computations. This eliminated EXT.L instructions.

I also optimized by treating the data structure as a huge amorphous block of bytes and explicitly computing the offsets into it. This was an optimization mainly because the “offset” operation only needs to be computed twice rather than 4 times. This is possible because the first two arrays (with the 65536 elements) are both indexed by the value ((red << 8) + green), which means that the numbers you fetch will always be 262144 bytes apart. The same type of relation holds for the two 256-element arrays.

The most significant optimizations are what make RGBtoYUV so hard to read. Most of the operations have been broken up and interleaved, to minimize pipeline stalls. Pipeline stalls generally occur because the result of one operation is used in the next operation. By reordering operations, the code can be made to run faster without actually decreasing the number of operations.

For best results, you should begin by breaking up statements into as many small steps as possible. For example, after doing the offset-indexing changes described above my code contained the statement:


/* 7 */
   p = p + 262144L - index + i2;

Breaking this up, we get three statements:


/* 8 */
   p += 262144L;
   p -= index;
   p += i2;

Now you can see that each of these statements depends on the previous value of p, and generate a new value of p. Usually in this situation you would want to interleave other unrelated operations so that the value of p isn’t being reused right away each time:


/* 9 */
   p += 262144L;
   (some other operation)
   p -= index;
   (some other operation)
   p += i2;

I didn’t have enough unrelated operations to do this. However, I noticed that it was okay to change the value of the variable index (since it isn’t used again) and that meant that I could think of it as


/* 10 */
   p = p + 262144L - (index - i2);
and transform that into:

   p += 262144L;
   index -= i2;
   p -= index;

This solved the problem quite nicely. There is still the problem that index is getting used right away, but I was able to find other operations to interleave and avoid that pipeline stall as well.

In one place I was able to optimize by breaking up x>>=12L into two copies of x>>=6L . This is because shifts by more than 8 require a temporary register to be loaded with the shift amount. Normally this type of transformation wouldn’t speed things up, but in this case I was able to move one of the resulting statements to reduce a pipeline stall.

Optimizations I Didn’t Do (and why)

One optimization I skipped involves cache misses. When running on any machine with a data cache larger than about 1K, the performance of this algorithm will depend greatly on the gamut of source pixel values. In other words, if the source pixels are scattered all around the RGB color cube, the loads (array reads) will cause a high incidence of cache misses, with corresponding degradation in performance. On sufficiently pipelined CPUs (the ‘040 and PowerPC) with a large SRAM cache card this means that an algorithm with 256-entry lookup tables would outperform an algorithm with 65536-entry lookup tables.

The ideal way to address this would be with adaptive dispatching to multiple alternate algorithms. Under such a scheme, the code would process the image data in chunks, benchmarking itself with each chunk and deciding based on the performance when to switch back and forth between the 256-entry algorithm and the 65536-entry algorithm.

Unfortunately, TickCount was too coarse for this application, and I didn’t want to bother with the microsecond timer.

Another optimization I skipped would handle identical source and destination buffers. It is conceivable that the caller might use the same buffers for the YUV output as for the RGB input. If a test for this were made, then three pointers could be used instead of six, allowing optimization. However, it isn’t clear which of [Y,U,V] would be the same as [R], and so on; since there are six possibilities I decided it wasn’t worth bothering.

If you are using an RGBtoYUV routine in your own program, you can probably put this optimization in quite easily.

I also refrained from unrolling the loop. After optimizing two versions of RGBtoYUV with the above techniques I tried loop unrolling. It improved version 1, but actually made version 2 worse. The unrolled version 1 was still slower than the non-unrolled version 2. Since unrolling might also have made it slower on the 68020 and 68030 (which have very small instruction caches) I decided to skip the idea entirely.

Benchmark Gotchas

I encountered a lot of variations in benchmark results because of cache TLB entries. For example, if you allocate 6 consecutive 1024-byte buffers for R, G, B, Y, U and V and call RGBtoYUV repeatedly on the contents of those buffers (without doing anything else between calls) it will usually run somewhat slower than if the buffers are further apart or are scattered around randomly in memory. There are also many dependencies related to the locations of the array indices accessed, which depend on the actual color entries used. Worst-case benchmarks usually result from filling the RGB buffers with patterns of consecutive values (as I needed to do for the code that verified correct translation of all possible [R,G,B] triples). Real RGB images would produce average results.

Also, as always I had to avoid moving the mouse to get consistent results every time. There was still a bit of fluctuation due to Ethernet traffic.

In Conclusion

By using a number of unique ideas we have arrived at an extremely fast and portable implementation of the color-space conversion, a reasonably typical iterative task. I think you will find these ideas useful in other problems - anything compute-intensive that uses integer operations. From Photoshop plug-ins and QuickTime codecs to cellular automata simulations or screen savers - the applications are virtually limitless!

I have verified the code in this article by copying the code out of this MS Word document and pasting into Think C, then running my test shell to verify proper operation. (Since I have written so many different versions of this program and had to do a lot of cutting and pasting to generate the listings given here, I was a bit uncertain if it would run.)

Listing 1


/* 11 */
RGBtoYUV
This is the RGBtoYUV routine before all the strange optimization was 

applied.  The final version is in the following listing.

void RGBtoYUV(unsigned char *ra, unsigned char *ga,
    unsigned char *ba, unsigned char *ya,
    signed char *ua, signed char *va,
    unsigned long numpix, void *pd)
{
    register signed32      index;
    register signed32      i2;
             unsigned32    i;
    register unsigned32    hi, lo_rg, lo;
    register unsigned char *yar = ya;
    register signed char   *var = va;
    register rgb_yuv_data  *p = pd;

    for(i=0; i<numpix; i++) {
        /* Compute indexes */
        index = ((*ra++)<<8)+(*ga++);
        i2 = *ba++;

        /* Get high and low word from the RG arrays */
        lo_rg = p->yuv_rg_l[index];
        hi = p->yuv_rg_h[index];

        /* Add high and low words from the B arrays */
        lo = lo_rg + p->yuv_b_l[i2];
        hi += p->yuv_b_h[i2];

        /* Test for carry */
        if (lo < lo_rg) {
            /* there was a carry! */
            hi++;

            /* Store results */
            *ua++ = hi;
            *yar++ = lo>>12L;
            *var++ = hi>>21L;
        } else {
            /* Store results */
            *ua++ = hi;
            *yar++ = lo>>12L;
            *var++ = hi>>21L;
        }
    }
}

Listing 2


/* 12 */
/* Prototypes */

void *RGBtoYUVInit(void);

void RGBtoYUV(unsigned char *ra, unsigned char *ga,
    unsigned char *ba, unsigned char *ya,
    signed char *ua, signed char *va,
    unsigned long numpix, void *pd);


/* Typedefs to tame the C language. No compiler switches, because short 
always 
 * seems to be 16-bit... */
typedef signed char signed8;
typedef unsigned char unsigned8;
typedef signed short signed16;
typedef unsigned short unsigned16;
typedef unsigned long unsigned32;
typedef signed long signed32;

/* Data structure for the parallel-add algorithms */
typedef struct rgb_yuv_data {
    unsigned32  yuv_rg_l[65536L];
    unsigned32  yuv_rg_h[65536L];
    unsigned32  yuv_b_l[256];
    unsigned32  yuv_b_h[256];
} rgb_yuv_data;

Transform matrix: Determines the orientation of the RGB color cube within 
YUV space, and the relative intensities of R, G, and B.
NOTE: Whenever the matrix is changed, error analysis has to be done to 
determine if 20 bits is still enough accuracy to determine the rounding 
direction of results!

signed32 ml[] = {
  2508194L,  4924113L,   956301L,
 -1415459L, -2778845L,  4194304L,
  4194304L, -3512206L,  -682098L };

RGBtoYUVInit
/* Parallel addition, 64-bit math version.
 * Hi format is 00CVVVVVVVVvvvvvvvvvvvvCUUUUUUUU
 * Lo format is uuuuuuuuuuuuYYYYYYYYyyyyyyyyyyyy
 */
void *RGBtoYUVInit()
{
    rgb_yuv_data  *p;
    Handle        h;
    OSErr         err;
    unsigned16    r, g, b;
    signed32      index;
    signed32      yl, ul, vl;
    unsigned32    yi;
    signed32      ui, vi;
    unsigned32    round_adjust = 0x7fe;
    unsigned32    lo12 = 0x0fffL;
    unsigned32    lo20 = 0xfffffL;
    unsigned32    datasize = sizeof(rgb_yuv_data);

    h = TempNewHandle(datasize, &err);
    HLock(h);
    p = *((rgb_yuv_data **) h);

    for (r=0; r<256; r++) {
        for (g=0; g<256; g++) {
            yl = ml[0] * ((signed32) r)
                         + ml[1] * ((signed32) g);
            ul = ml[3] * ((signed32) r)
                         + ml[4] * ((signed32) g);
            vl = ml[6] * ((signed32) r)
                         + ml[7] * ((signed32) g);

            yi = (yl + 1024)>>11L; yi += round_adjust;
            ui = (ul + 1024)>>11L; ui += round_adjust;
            vi = (vl + 1024)>>11L; vi += round_adjust;

            index = (((long)r)<<8L) | ((long)g);

            (p->yuv_rg_h)[index] =
                ((vi&lo20)<<9L) | ((ui&lo20)>>12L);

            (p->yuv_rg_l)[index] = ((ui&lo12)<<20L) | yi;
       }
    }

    for (b=0; b<256; b++) {
        yl = ml[2] * ((signed32) b);
        ul = ml[5] * ((signed32) b);
        vl = ml[8] * ((signed32) b);

        yi = (yl + 1024)>>11L;
        ui = (ul + 1024)>>11L;
        vi = (vl + 1024)>>11L;

        index = b;
        p->yuv_b_h[index] =
                    ((vi&lo20)<<9L) | ((ui&lo20)>>12L);

        p->yuv_b_l[index] = ((ui&lo12)<<20L) + yi;
   }

    return ((void *)p);
}

RGBtoYUV for 64-bit math
RGBtoYUV for 64-bit math version.
It’s hard to read because the instructions were reordered to minimize 
pipeline stalls from result dependencies on the ‘040.

void RGBtoYUV(unsigned char *ra, unsigned char *ga,
    unsigned char *ba, unsigned char *ya,
    signed char *ua, signed char *va,
    unsigned long numpix, void *pd)
{
    register signed32      index;
    register signed32      i2;
             unsigned32    i;
    register unsigned32    hi, lo_rg, lo;
    register unsigned char *yar = ya;
    register signed char   *var = va;
    register unsigned8     *p;

    for(i=0; i<numpix; i++) {
        p = (unsigned8 *) pd;
        index = ((long)(*ra));     /* Compute indexes */
        i2 = ((long)(*ba));
        index <<= 8L;
        index += ((long)(*ga));
        i2 <<= 2L;
        index <<= 2L;
        ga++;                      /* Increment src ptrs */
        p += index;
        lo_rg = *((unsigned32 *)(p)); /* yuv_rg, low */
        p += 262144L;
        ra++;
        hi = *((unsigned32 *)(p)); /* yuv_rg, high */
        index -= i2;
        p += 262144L;
        ba++;
        p -= index;
        lo = lo_rg + *((unsigned32 *)(p));
        hi += *((unsigned32 *)(p+1024L));
        if (lo < lo_rg) {
            hi++;         /* there was a carry! */
            lo >>= 6L;    /* Store the results */
            *ua = hi;
            lo >>= 6L;
            *yar++ = lo;
            hi >>= 21L;
            ua++;
            *var++ = hi;
        } else {
            lo >>= 6L;    /* Store the results */
            *ua = hi;
            lo >>= 6L;
            *yar++ = lo;
            hi >>= 21L;
            ua++;
            *var++ = hi;
        }
    }
}







  
 
AAPL
$105.22
Apple Inc.
+0.39
MSFT
$46.13
Microsoft Corpora
+1.11
GOOG
$539.78
Google Inc.
-4.20

MacTech Search:
Community Search:

Software Updates via MacUpdate

Ember 1.8.2 - Versatile digital scrapboo...
Ember (formerly LittleSnapper) is your digital scrapbook of things that inspire you: websites, photos, apps or other things. Just drag in images that you want to keep, organize them into relevant... Read more
Tonality Pro 1.1.2 - Professional-grade...
Tonality Pro gives you the power to create stunning and dramatic black & white images. This is a complete monochrome image editor with more than 150 one-click style presets, totally unique... Read more
VueScan 9.4.49 - Scanner software with a...
VueScan is a scanning program that works with most high-quality flatbed and film scanners to produce scans that have excellent color fidelity and color balance. VueScan is easy to use, and has... Read more
OS X Server 4.0 - For OS X 10.10 Yosemit...
Designed for OS X and iOS devices, OS X Server makes it easy to share files, schedule meetings, synchronize contacts, develop software, host your own website, publish wikis, configure Mac, iPhone,... Read more
TotalFinder 1.6.12 - Adds tabs, hotkeys,...
TotalFinder is a universally acclaimed navigational companion for your Mac. Enhance your Mac's Finder with features so smart and convenient, you won't believe you ever lived without them. Tab-based... Read more
BusyCal 2.6.3 - Powerful calendar app wi...
BusyCal is an award-winning desktop calendar that combines personal productivity features for individuals with powerful calendar sharing capabilities for families and workgroups. BusyCal's unique... Read more
calibre 2.7 - Complete e-library managem...
Calibre is a complete e-book library manager. Organize your collection, convert your books to multiple formats, and sync with all of your devices. Let Calibre be your multi-tasking digital... Read more
Skitch 2.7.3 - Take screenshots, annotat...
With Skitch, taking, annotating, and sharing screenshots or images is as fun as it is simple.Communicate and collaborate with images using Skitch and its intuitive, engaging drawing and annotating... Read more
Delicious Library 3.3.2 - Import, browse...
Delicious Library allows you to import, browse, and share all your books, movies, music, and video games with Delicious Library. Run your very own library from your home or office using our... Read more
Art Text 2.4.8 - Create high quality hea...
Art Text is an OS X application for creating high quality textual graphics, headings, logos, icons, Web site elements, and buttons. Thanks to multi-layer support, creating complex graphics is no... Read more

Latest Forum Discussions

See All

Rami Ismail Opens Up distribute​() for D...
Rami Ismail Opens Up distribute​() for Developers Posted by Jessica Fisher on October 24th, 2014 [ permalink ] Rami Ismail, Chief Executive of Business and Development at indie game studio | Read more »
Great Hitman GO Goes on Sale and Gets Ne...
Great Hitman GO Goes on Sale and Gets New Update – Say That Three Times Fast Posted by Jessica Fisher on October 24th, 2014 [ permalink ] | Read more »
Rival Stars Basketball Review
Rival Stars Basketball Review By Jennifer Allen on October 24th, 2014 Our Rating: :: RESTRICTIVE BUT FUNUniversal App - Designed for iPhone and iPad Rival Stars Basketball is a fun mixture of basketball and card collecting but its... | Read more »
Rubicon Development Makes Over a Dozen o...
Rubicon Development Makes Over a Dozen of Their Games Free For This Weekend Only Posted by Jessica Fisher on October 24th, 2014 [ permalink ] | Read more »
I Am Dolphin Review
I Am Dolphin Review By Jennifer Allen on October 24th, 2014 Our Rating: :: NEARLY FIN-TASTICUniversal App - Designed for iPhone and iPad Swim around and eat nearly everything that moves in I Am Dolphin, a fun Ecco-ish kind of game... | Read more »
nPlayer looks to be the ultimate choice...
Developed by Newin Inc, nPlayer may seem like your standard video player – but is aiming to be the best in its field by providing high quality video play performance and support for a huge number of video formats and codecs. User reviews include... | Read more »
Fighting Fantasy: Caverns of the Snow Wi...
Fighting Fantasy: Caverns of the Snow Witch Review By Jennifer Allen on October 24th, 2014 Our Rating: :: CLASSY STORYTELLINGUniversal App - Designed for iPhone and iPad Fighting Fantasy: Caverns of the Snow Witch is a sterling... | Read more »
A Few Days Left (Games)
A Few Days Left 1.01 Device: iOS Universal Category: Games Price: $3.99, Version: 1.01 (iTunes) Description: Screenshots are in compliance to App Store's 4+ age rating! Please see App Preview for real game play! **Important: Make... | Read more »
Toca Boo (Education)
Toca Boo 1.0.2 Device: iOS Universal Category: Education Price: $2.99, Version: 1.0.2 (iTunes) Description: BOO! Did I scare you!? My name is Bonnie and my family loves to spook! Do you want to scare them back? Follow me and I'll... | Read more »
Intuon (Games)
Intuon 1.1 Device: iOS Universal Category: Games Price: $.99, Version: 1.1 (iTunes) Description: Join the battle with your intuition in a new hardcore game Intuon! How well do you trust your intuition? Can you find a needle in a... | Read more »

Price Scanner via MacPrices.net

Weekend sale: 13-inch 128GB MacBook Air for $...
Best Buy has the 2014 13-inch 1.4GHz 128GB MacBook Air on sale for $849.99, or $150 off MSRP, on their online store. Choose free home shipping or free local store pickup (if available). Price valid... Read more
Nimbus Note Cross=Platform Notes Utility
Nimbus Note will make sure you never forget or lose your valuable data again. Create and edit notes, save web pages, screenshots and any other type of data – and share it all with your friends and... Read more
NewerTech’s Snuglet Makes MagSafe 2 Power Con...
NewerTech has introduced the Snuglet, a precision-manufactured ring designed to sit inside your MagSafe 2 connector port, providing a more snug fit to prevent your power cable from unintentional... Read more
Apple Planning To Sacrifice Gross Margins To...
Digitimes Research’s Jim Hsiao says its analysts believe Apple is planning to sacrifice its gross margins to save its tablet business, which has recently fallen into decline. They project that Apple’... Read more
Who’s On Now? – First Instant-Connect Search...
It’s nighttime and your car has broken down on the side of the highway. You need a tow truck right away, so you open an app on your iPhone, search for the closest tow truck and send an instant... Read more
13-inch 2.5GHz MacBook Pro on sale for $949,...
Best Buy has the 13″ 2.5GHz MacBook Pro available for $949.99 on their online store. Choose free shipping or free instant local store pickup (if available). Their price is $150 off MSRP. Price is... Read more
Save up to $125 on Retina MacBook Pros
B&H Photo has the new 2014 13″ and 15″ Retina MacBook Pros on sale for up to $125 off MSRP. Shipping is free, and B&H charges NY sales tax only. They’ll also include free copies of Parallels... Read more
Apple refurbished Time Capsules available sta...
The Apple Store has certified refurbished Time Capsules available for up to $60 off MSRP. Apple’s one-year warranty is included with each Time Capsule, and shipping is free: - 2TB Time Capsule: $255... Read more
Textilus New Word, Notes and PDF Processor fo...
Textilus is new word-crunching, notes, and PDF processor designed exclusively for the iPad. I haven’t had time to thoroughly check it out yet, but it looks great and early reviews are positive.... Read more
WD My Passport Pro Bus-Powered Thunderbolt RA...
WD’s My Passport Pro RAID solution is powered by an integrated Thunderbolt cable for true portability and speeds as high as 233 MB/s. HighlightsOverviewSpecifications Transfer, Back Up And Edit In... Read more

Jobs Board

*Apple* Solutions Consultant - Apple Inc. (U...
…important role that the ASC serves is that of providing an excellent Apple Customer Experience. Responsibilities include: * Promoting Apple products and solutions Read more
Senior Event Manager, *Apple* Retail Market...
…This senior level position is responsible for leading and imagining the Apple Retail Team's global event strategy. Delivering an overarching brand story; in-store, Read more
*Apple* Solutions Consultant (ASC) - Apple (...
**Job Summary** The ASC is an Apple employee who serves as an Apple brand ambassador and influencer in a Reseller's store. The ASC's role is to grow Apple Read more
Project Manager / Business Analyst, WW *Appl...
…a senior project manager / business analyst to work within our Worldwide Apple Fulfillment Operations and the Business Process Re-engineering team. This role will work Read more
*Apple* Retail - Multiple Positions (US) - A...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.