Notes on Trees

Trees

Please note: There were a few semantic typos in the pseudocode. Please make sure you understand the current version and verify for yourself that it is correct. find() and remove() were modified from their original form.

Basics

A tree is defined as either an empty tree or a node (or vertex) with zero or more subtrees. The subtrees are themselves trees by this same definition. Note that this is a recursive definition of a tree. The base case is a tree that is empty (i.e. has no nodes).

Terminology

The top of a given tree is a distinguished node called the root of the tree. The connections from the root to its subtrees are called edges. Each of the root's subtrees may also have its own root (after all, it is its own tree). If the root has k subtrees (numbered [1..k]), the roots of each of these subtrees is said to be a child of the root of the tree. Similarly, the root is called the parent of these children. Children of the same parent node are called siblings. The analogy to geneology can be extended ad nauseum.

Any node in a tree may have children. A node with no children (that is, it has only empty subtrees or no subtrees at all) is called a leaf node. Non-leaf nodes are called internal nodes. The branch factor of a tree is the maximum number of children for a single node in the tree. That is, in a tree with a branch factor of three, no node has more than three children.

A path in a tree is a sequence of nodes (not edges) such that for each pair of nodes, n_i is the parent of n_i+1 in that sequence. That is, the first node in the path is the parent of the second, the second is the parent of the third, etc. For any two nodes n_i and n_j, n_i is an ancestor of n_j if and only if i < j on some path in the tree. Similarly, n_j is said to be a descendant of n_i. Note that any node in a general tree may be uniquely identified by the path from the root to that node. Note also that root is an ancestor of all of the nodes in its subtrees and that every node in one of root's subtrees is a descendant of the root.

The length of a path is the number of edges crossed in following that path. This is always one less than the number of nodes on the path. The depth of a node is the length of the path from the root to that node. The depth of a tree is the depth of its deepest leaf. Similarly the height of a node is the depth of the subtree rooted at that node. The height of a tree is the height of its root.

Binary Trees

Binary trees are the same general trees defined above with the additional restriction that their branch factor is two.

Binary Tree Traversals

There three basic ways to traverse binary trees. They are named for when the root is visited in relation to its children. preorder(t) visit(t) preorder(left(t)) preorder(right(t)) postorder(t) postorder(left(t)) postorder(right(t)) visit(t) inorder(t) inorder(left(t)) visit(t) inorder(right(t)) Note that preorder and postorder can be extended easily for trees with an arbitrarily high branch factor but that inorder traversals really only make sense on binary trees.

The call to visit(t) can perform whatever operation is necessary at that node (e.g. print the value).

Binary Search Trees

A binary search tree (or BST) is a binary tree with the additional restriction that for a given node with a value k, all of the values in k's left subtree are less than the value of k and all of the values in k's right subtree are greater than its value. Duplicates may either be ignored or stored in some sort of data structure at each node depending on your application.

Insertion

Insertion into a BST looks roughly like this: insert(v, t) if (t is empty) return new node(v, empty, empty) else if (v < value(t)) return node(value(t), insert(v, left(t)), right(t)) else return node(value(t), left(t), insert(v, right(t))) That is, look at the current node's value. If v is less than it, insert this value into the left subtree of t. If v is greater than it, insert this value into the right subtree of t. If t is empty, make a new tree of just this node and return it.

Find

BST find is almost identical. This version returns the subtree rooted at v or empty if the value is not found. find(v, t) if (t is empty) return empty else if (v == value(t)) return t else if (v < value(t)) return find(v, left(t)) else return find(v, right(t))

Deletion

In deleting a node we must be careful to maintain the BST properties. Deletion starts with a find on that node. When that node is found, we want to leave the structure of the tree intact while removing this value. That means we need to find a replacement for the value in the tree. In order to maintain the BST property, we need another value that will maintain that property. Where could we find such a value?

Successors and Predecessors

Note that if you print a BST using an inorder traversal, you will end up with a sorted list of values. Note that if we were to pick out the root from this list, all of the values to the left of it would be in its left subtree and all of the values to the right of it would be in its right subtree.

Now realize that if we remove a node from a BST, this traversal needs to remain in order, but the value at that node must be gone. What fills its place? In the traversal, it's replaced by either the next smallest value in the tree or the next largest. These nodes are called the predecessor and successor respectively. These are the candidates for replacing a given node. Why?

It turns out that if you replace a node with the next smallest value in its subtrees, all of the values in its smaller subtree are still smaller than that value. This is true because the predecessor is the largest value from the smaller subtree. This case is symmetric with choosing the next largest value. The successor will be the smallest value in the right subtree and the predecessor will be the largest value in the left subtree. Verify for yourself that this is true.

Given that, the algorithm for removal from a BST is relatively simple:

remove(v, t) if (t is empty) // Not in tree - do nothing return if (v < value(t)) remove(v, left(t)) if (v > value(t)) remove(v, right(t)) else if (left(t) is empty) if (right(t) is empty) delete t else value(t) = findMin(right(t)) remove(findMin(right(t)), right(t)) else value(t) = findMax(left(t)) remove(findMax(left(t)), left(t)) Note that at the end it's important to remove the replacement from its original subtree since a given value cannot be in two places in a BST at the same time.