Binary Trees
Volume Number:   6

Issue Number:   8

Column Tag:   Language Translation

Binary Trees
By Clifford Story, Mouunt Prospect, IL
Note: Source code files accompanying article are located on MacTech CDROM or source code disks.
A. Introduction
This article, Part V, I’m going to take an excursion away from language translation and develop some support routines. In particular, I’m going to create routines to handle binary trees; later, I’ll use these routines to implement a symbol table for the new Canon tool and the fabled Inline tool.
There are a couple of “Why?” questions to answer. First question, why fill MacTutor with a long article on something that is taught in every college computer department in the U.S.? Ah, but not all of us have computer degrees; I don’t, for example. And while you can find the subject in various books, I couldn’t find a decent explanation of balanced binary trees  I had to figure it out from hints I gleaned from Knuth’s discussion (Searching and Sorting, pages 451  457). Second question, why a binary tree? Aren’t symbol tables usually done with hash tables? Well, yes, this is my impression, and Aho and Ullman state that hashing is “[t]he most efficient and commonly used method” for maintaining symbol tables (The Theory of Parsing, Translation, and Compiling, volume II, page 793). So why a binary tree? I’m sorry, I haven’t got an answer for that one...
If you should ever write your own compiler, I guess you should use a hash table. After all, Aho & Ullman’s opinion carries a lot of weight. In the meantime, binary trees are fun, so sit up and enjoy!
B. Simple Binary Trees
A binary tree is a tree with at most two subtrees at each node. These are commonly called the left and right subtrees. Searching a binary tree is very simple: take the left branch if the search key is less than the node’s value, and the right branch if it is greater. Keep going until you get either an exact match (in which case the search succeeds), or a nil pointer (in which case the search fails).
The example program I have written implements search and insertion, as well as a recursive dump routine. It does not do deletions. I haven’t any use for deletions in a symbol table, so I have left that part as an exercise...
B(1). Node Structure
Each node has four fields: the node’s value or key, a pointer to the left subtree, a pointer to the right subtree, and the data. Here is the node structure I will use in my example:
/* 1 */
typedef struct node
{
char *key;
struct node*left;
struct node*right;
char *data;
} node;
The data for this example is some arbitrary character string.
I’ll going to create new nodes with the Mac trap NewPtr. I would usually use an array of nodes with a free list, and allocate them myself; I’m using NewPtr in this case for simplicity’s sake.
/* 2 */
// createnode
node *createnode(char *thekey,
char *thedata)
{
node *thenode;
int thelength;
thenode = (node *)
NewPtr(sizeof(node));
if (thenode == nil)
return(nil);
thelength = 1 + strlen(thekey);
thenode>key = (char *)
NewPtr(thelength);
if (thenode>key == nil)
{
DisposPtr((Ptr)thenode);
return(nil);
}
BlockMove((Ptr)thekey,
(Ptr)thenode>key, thelength);
thenode>left = nil;
thenode>right = nil;
thelength = 1 + strlen(thedata);
thenode>data = (char *)
NewPtr(thelength);
if (thenode>data == nil)
{
DisposPtr((Ptr)thenode>key);
DisposPtr((Ptr)thenode);
return(nil);
}
BlockMove((Ptr)thedata,
(Ptr)thenode>data, thelength);
return(thenode);
}
Finally, I’ll start off the table with a head node, a dummy node with no data that points to the true root of the tree. That way, I can create the tree before any insertions; I don’t have to specialcase inserting data into a nil tree.
B(2). Finding a Node by Key
As I said above, looking up a node by key is easy. The routine below is passed a pointer to the root node of the tree, and the search key. It starts with the root and follows the algorithm, until either it finds the node it is looking for and falls out of the loop, or finds a nil pointer and returns it.
/* 3 */
// lookup
node *lookup(node *thetable,
char *thekey)
{
node *thenode;
int compare;
thenode = thetable;
while (compare = strcmp(
thekey, thenode>key))
{
if (compare < 0)
thenode = thenode>left;
else
thenode = thenode>right;
if (thenode == nil)
return(nil);
}
return(thenode);
}
B(3). Inserting a Node
This routine inserts a new node into the tree. First, it performs a search not unlike that in the “lookup” routine, except that it returns nil (indicating failure) if the search is successful  i.e., a duplicate key exists. If, on the other hand, it finds a nil pointer, then that is the place for the new node, and the routine inserts it.
/* 4 */
// insert
node *insert(node *thetable,
char *thekey, char *thedata)
{
node *thenode;
int compare;
node *thechild;
thenode = thetable;
while (compare = strcmp(
thekey, thenode>key))
{
if (compare < 0)
{
thechild = thenode>left;
if (thechild == nil)
{
thenode>left = createnode(
thekey, thedata);
return(thenode>left);
}
}
else
{
thechild = thenode>right;
if (thechild == nil)
{
thenode>right = createnode(
thekey, thedata);
return(thenode>right);
}
}
thenode = thechild;
}
return(nil);
}
B(4). Dumping and Destroying
Tree structures are natural candidates for recursive routines, and here are two. The “dump” routine displays the entire tree, in key order. The “destroy” routine disposes of all the dynamic memory used by the tree.
/* 5 */
// dump
void dump(node *thetable)
{
if (thetable == nil)
return;
dump(thetable>left);
printf(“%10s  %10s  %10s\n”,
thetable>key,
thetable>left ?
thetable>left>key : “nil”,
thetable>right ?
thetable>right>key : “nil”);
dump(thetable>right);
}
// destroy
void destroy(node *thetable)
{
if (thetable == nil)
return;
destroy(thetable>left);
destroy(thetable>right);
DisposPtr((Ptr)thetable>key);
DisposPtr((Ptr)thetable>data);
DisposPtr((Ptr)thetable);
}
B(5). A Test Shell
/* 6 */
// Binary.c
// 
// Binary tree search and insertion
#pragma load “managers”
// Constants and Macros
#define nil 0
// Types
typedef enum {false, true} logical;
typedef struct node
{
char *key;
struct node*left;
struct node*right;
char *data;
} node;
// Prototypes
void initmac();
node *createnode(char *thekey,
char *thedata);
node *insert(node *thetable,
char *thekey, char *thedata);
node *lookup(node *thetable,
char *thekey);
void dump(node *thetable);
void destroy(node *thetable);
// main
main()
{
node *thetable;
char thestring[256];
char command;
char thekey[256];
char thedata[256];
node *thenode;
initmac();
thetable = createnode(“”, “”);
if (thetable == nil)
{
printf(“\tUnable to “
“create table\n”);
exit(2);
}
printf(“? “);
while (true)
{
gets(thestring);
sscanf(&thestring[2],
“%c %s %[^\n]”,
&command, thekey, thedata);
switch (command)
{
case ‘A’:
case ‘a’:
if (insert(thetable, thekey,
thedata) == nil)
printf(“\tKey “
“already exists “
“in table\n”);
break;
case ‘D’:
case ‘d’:
dump(thetable>right);
break;
case ‘F’:
case ‘f’:
thenode = lookup(
thetable, thekey);
if (thenode == nil)
{
printf(“\tKey not “
“found in table\n”);
break;
}
printf(“\tData is “%s”\n”,
thenode>data);
break;
case ‘Q’:
case ‘q’:
destroy(thetable);
exit(0);
}
printf(“? “);
}
}
// initmac
void initmac()
{
InitGraf((Ptr)&qd.thePort);
SetFScaleDisable(true);
InitCursorCtl(nil);
}
C. Balanced Binary Trees
Simple binary trees are pretty and elegant but they aren’t perfect. Remember, I’ve promised a Canon tool will come out of all this, right? Aren’t most Canon dictionaries in alphabetical order? What happens to a binary tree when you insert already ordered data?
If you insert already ordered data into a simple binary tree, you get a degenerate tree. All the insertions go to the right; all the left pointers are nil. What is left is a linked list. Lookups in a linked list are o(N), instead of the o(log N) we would expect from a binary tree. (This means that the number of comparisons is proportional to the number of nodes, rather than to the log of the number of nodes. o(log N) is much faster than o(N).)
One solution to this problem is to use a balanced binary tree. This involves a little more work, and a more complicated insertion algorithm, but it restores the o(log N) running time.
C(1). What is a Balanced Tree?
The height of a node is the number of steps to its most distant descendant. The height of a tree (or a subtree) is the height of its root node. A binary tree is balanced if for every node in the tree, the height of the left subtree is within 1 of the height of the right subtree.
Here’s another way to say the same thing: each node in a binary tree has a balance factor, which is equal to the height of the left subtree minus the height of the right subtree. A binary tree is balanced if every balance factor is either 0, 1 or 1.
It’s pretty easy to see that a balanced tree is not going to exhibit the sort of degenerate behavior that simple binary trees can fall into. It’s also easy to see that inserting new nodes into the tree can throw it out of balance, which is why insertion is a bit more complex than in the simple case.
C(2). Searching, Dumping, Destroying
Balancing a binary tree requires a minor change to the node structure. We have to add a “balance” field, to keep track of the balance factor:
/* 7 */
typedef struct node
{
char *key;
struct node*left;
struct node*right;
int balance;
char *data;
} node;
The “createnode” routine initializes this new field to zero.
Searching, dumping and destroying the tree are identical to the routines for the simple binary tree, already shown above.
C(3). Balancing the Tree
We do insertions just the same way as for a simple binary tree, until we unbalance the tree. Then we have to rebalance it. We know the tree is out of balance if we get a node with a balance factor of ±2. Fortunately, balancing a tree turns out to be a localized operation.
Suppose that as we search the tree for the place to insert the new node, H is the last node we pass with a balance factor of ±1. Then the tree is imbalanced after the insertion if and only if H’s balance factor becomes ±2.
Proof: if H’s balance factor is ±2, then the tree is imbalanced by definition. On the other hand, if the tree is imbalanced, where is the ±2 node? It isn’t below H, since those nodes all had zero balance factors and can now be at most ±1 (in fact, they are all ±1; none remains at zero). If H’s balance factor isn’t ±2 after the insertion, then either it stayed ±1 or it went to zero. In the first case, the heights of H’s subtrees didn’t change. In the second, the height of the shorter tree increased. In both cases, the height of H itself remained the same, so the balance factors of all the nodes above H remained the same. Thus, if H isn’t the ±2 node, then the tree is balanced, contrary to assumption.
What’s more, the process of rebalancing H’s subtree will leave the height of the subtree the same as its height before the insertion. So after balancing the subtree, the entire tree will be balanced. This means that we need only concern ourselves with a handful of nodes around H  four nodes, at most.
C(3)(a). Leftleft insertion
The insertion algorithm divides into four cases. After finding node H, we move either left or right to continue searching for the place to insert the new node, and after the next node, we again move left or right. So the “leftleft” case is the one in which we took two left branches after finding H.
C(3)(a)(i). Before Inserting the New Node
Here’s the immediate situation before inserting the new node:
Figure 1: Leftleft case before insertion.
The ovals represent nodes; the capital letters are node names, and the numbers after the colons are the balance factors. The rectangles represent subtrees; the lowercase letters are their names, and the expressions under the horizontal lines are tree heights. Node P is the parent of node H, the high node, which is in turn the parent of L, the low node. The new node will go into subtree a.
P’s balance factor is irrelevant. H’s is ±1 by definition (this is the same H as was discussed earlier, the last node with a nonzero balance factor), and L’s is zero for the same reason (definition of H). Since we’re adding the new node to the left subtree of H, which will increase H’s balance factor and unbalance the tree, H’s balance factor must be +1, not 1.
L’s balance factor is zero, so a and b have the same height, call it h. Then the height of L is h+1. H’s balance factor is +1, so c’s height is one less than L’s, or h.
Also, note that the height of H is h+1. And listing the parts of the subtree in order of keys, we get: a, L, b, H, c, since L, a and b are to the left of H and c to the right, and L is between a and b. These facts are important; I have already promised that inserting the new node and rebalancing the tree will not change the subtree’s height, and the order of the nodes must be preserved by any balancing transformation.
C(3)(a)(ii). After Inserting the New Node
Now we insert the new node in subtree a:
Figure 2: Leftleft case after insertion.
The new node increases the height of a by 1, to h+1. The height of b remains h, so L’s balance factor is now +1. The height of L goes to h+2, which makes H’s balance factor +2, since the height of c is unchanged. The tree is now out of balance.
C(3)(a)(iii). After ReBalancing the Tree
At this point, we need to perform some transformation on the tree that will bring it back into balance. I’m going to just pull it out of the hat:
Figure 3: Leftleft case after transformation.
What we did was switch b to H, and lift subtree a, the big one, higher. The heights of a, b and c haven’t changed, and it’s easy to show that the new balance factors are correct. The important question is, is this still a properlyordered binary tree?
Recall the order of nodes before the insertion: a, L, b, H, c. Has the order changed? Now a is to the left of L, and H, b and c to the right; H is between b and c: a, L, b, H, c. The order is the same.
Next, I promised that the height of the subtree would be the same after rebalancing as before insertion. The height before was h+1; it’s easy to see that it’s h+1 now. This fact allows us to ignore all the nodes above the subtree.
C(3)(b). Leftright insertion
A leftright insertion looks like a leftleft insertion, except that the new node goes in subtree b instead of subtree a:
Figure 4: Leftright case after insertion.
Unfortunately, we can’t employ the same trick. Shifting b to H and raising L will result in exactly the same picture as above, only reversed horizontally. We have to take a different view of the situation:
Figure 5: Leftright case before insertion.
C is the root of our former subtree b, which is now split into subtrees b and c; the former subtree c is now subtree d. The height of C is the same as the height of the old b, or h, so the heights of the new b and c are both h1 (they’re equal because the balance factor of C is zero, by definition of H).
The new node goes into either b or c. It doesn’t matter much which; this will only affect balance factors after the rebalancing, and we can handle it then. So the tree after insertion looks like this:
Figure 6: Leftright case after insertion.
(Of course, the new node goes in only one of b and c, not in both! I had to draw it somehow...) The new heights and balance factors are easy to confirm.
And the balancing transformation is:
Figure 7: Leftright case after transformation.
L’s balance factor will be either 0 or +1, and H’s either 1 or 0, depending on whether the new node went into b or c. We have to check to make sure that neither the node order nor the height have changed (they haven’t).
C(3)(c). Rightright insertion
The rightright insertion is just the reverse of the leftleft, so I’ll content myself with showing the pictures, and refer you back to leftleft insertion for explanation.
Figure 8: Rightright case before insertion.
Figure 9: Rightright case after insertion.
Figure 10: Rightright case after transformation.
C(3)(d). Rightleft insertion
The rightleft insertion is just the reverse of the leftright, so I’ll content myself with showing the pictures, and refer you back to leftright insertion for explanation.
Figure 11: Rightleft case before insertion.
Figure 12: Rightleft case after insertion.
Figure 13: Rightleft case after transformation.
C(4). The Insertion Algorithm
The above exhausts all possible cases when an insertion unbalances the tree. What about insertions that leave the tree balanced? We can’t simply ignore these, since they change balance factors as well.
Suppose a node has a balance factor of zero (which means its two subtrees have the same height), and one subtree gets higher. There are two results: the node’s balance factor changes to +1 (if the left subtree grew) or 1 (if the right subtree grew), and the height of the node increases by 1. This second result will force a reevaluation of the node’s parent, and the process will continue back up the search path so long as balance factors are zero. So we can see inserting the new node as the first child of its parent sets off a chain reaction back up the tree.
What if a node has a balance factor of ±1, and one subtree gets higher? If the subtree that already is higher is the one to grow, then we’ve unbalanced the tree. If the shorter subtree gets higher, then the node’s balance factor goes to zero. The first case we handled above; neither case changes the height of the node, so both kill the chain reaction.
So the highlevel view of the insertion algorithm is this: First, insert the new node in the tree. Then, back up along the search path for as long as the nodes have zero balance factors, changing those balance factors to +1 or 1 according to whether the path went left or right when leaving the node. When a nonzero balance factor finally appears, increment or decrement it, according to the same test. If the result is zero, we’re done. If it’s ±2, the tree is unbalanced, so apply the appropriate balancing transformation.
C(5). The “insert” Routine
The following routine uses one cute trick: it keeps a stack of search directions in an int variable, and returns it as its result. This requires some explanation.
The integer has 32 bits. I divide it into 8 4bit fields. The routine is called recursively going down the search path; it returns values going back up. The value it returns is the value it got, shifted left by one field, and with the direction from “parent” in the rightmost field: 1 for left, 2 for right.
The function result thus shows the direction from “parent” in the rightmost field, the direction from parent’s child in the next field, the direction from parent’s grandchild in the next field, and that’s as far as we need to go.
There are two special values: zero means “no further adjustment of balance factors is necessary”, and 4 means “error”, which in this context means a duplicate key (there are other possible errors but I cleaned out the errordetection code for simplicity’s sake).
/* 8 */
// insert
unsigned int insert(node *parent,
char *thekey, char *thedata)
{
int compare;
node *high;
unsigned int result;
node *low;
node *child;
// if thekey matches this node,
// then return an error
// (duplicate keys)
compare = strcmp(thekey, parent>key);
if (compare == 0)
return(4);
// if there’s a slot for the new node,
// then create it and return
// 1 if it is to the left of
// parent and 2 if to the right
// else determine high, the next step
// in the search path
if (compare < 0)
{
high = parent>left;
if (high == nil)
{
parent>left = createnode(
thekey, thedata);
return(1);
}
}
else
{
high = parent>right;
if (high == nil)
{
parent>right = createnode(
thekey, thedata);
return(2);
}
}
// now continue the search by
// calling “insert” recursively
// if the result is 0 (no balancing
// needed at this level) or
// the result is 4 (duplicate
// key), return
result = insert(high,
thekey, thedata);
if ((result == 0)  (result == 4))
return(result);
// the low 4 bits of “result” indicate
// the direction of the search path
// leaving “high”;
// increment high’s balance if the path
// goes left, and decrement it if the
// path goes right
if (result % 16 == 1)
high>balance++;
else
high>balance;
// if the balance is now zero, no further
// correction of balances is needed,
// so return
// if it’s ±1, push the direction away
// from “parent” onto the “result”
// stack and return it
// if it’s ±2, continue to balancing
// transformation
switch (high>balance)
{
case 0:
return(0);
case 1:
case 1:
return((result << 4)
+ ((compare < 0) ? 1 : 2));
case 2:
case 2:
break;
}
// use the direction away from “high” to
// find “low”, then switch on that
// direction and the next one
// the easiest way to follow these
// transformations is to refer to the
// pictures
low = (result % 16 == 1) ?
high>left : high>right;
switch (result % 256)
{
case 0x11: // leftleft case
high>left = low>right;
low>right = high;
if (compare < 0)
parent>left = low;
else
parent>right = low;
high>balance = 0;
low>balance = 0;
return(0);
case 0x12: // rightleft case
child = low>left;
high>right = child>left;
low>left = child>right;
if (compare < 0)
parent>left = child;
else
parent>right = child;
child>left = high;
child>right = low;
child>balance = 0;
low>balance = 0;
high>balance = 0;
result &= 0xF00;
if (result == 0x100)
low>balance = 1;
else if (result == 0x200)
high>balance = 1;
return(0);
case 0x21: // leftright case
child = low>right;
low>right = child>left;
high>left = child>right;
if (compare < 0)
parent>left = child;
else
parent>right = child;
child>left = low;
child>right = high;
child>balance = 0;
low>balance = 0;
high>balance = 0;
result &= 0xF00;
if (result == 0x100)
high>balance = 1;
else if (result == 0x200)
low>balance = 1;
return(0);
case 0x22: // rightright case
high>right = low>left;
low>left = high;
if (compare < 0)
parent>left = low;
else
parent>right = low;
high>balance = 0;
low>balance = 0;
return(0);
}
}
D. Conclusion
Before I depart, let me mention the existence of Btrees, which are sort of generalized binary trees. Btrees have several keys at each node, and several plus one pointers. For example, a 23 tree has two keys and three pointers per node. But not all of these keys need be used; a node in a 23 tree may have one or two keys. This makes the tree structure flexible enough to add the requirement that all leaves (nodes with no children) be the same distance from the root.
I think there are also B*trees, or am I confusing this with C* algebras? Anyway, Btrees are a hot topic these days but I have no experience with them.
Next time, I’ll return to lexical analysis and do the new Canon tool. See you then.