Union find

Union find is an algorithm which uses a disjoint-set data structure to solve the following problem: Say we have some number of items. We are allowed to merge any two items to consider them equal (where equality here obeys all of the properties of an Equivalence Relation). At any point, we are allowed to ask whether two items are considered equal or not.

The most naive method involves using multiple lists. To merge two elements, we will find which lists they are in, and then append one of them to the other. To check for equality, we will again find which lists the two elements are in, and check that they are the same list. Each operation in this case takes $$O(n)$$ time in the worst case, for a total of $$O(n^2)$$ runtime.

Consider a different implementation of the naive method. Each element will be part of a node, which will contain the element and a pointer to the node's parent, if it exists. Thus the structure ends up looking like a forest of trees, where each node points to its parent. If two elements are in the same tree, then they are considered equal. To find out if they are in the same tree, we note that this can be uniquely determined by finding the root of the tree, which is considered the "representative element" of this tree. So to find the representative element for some arbitrary element, get the appropriate node and follow parent pointers until we can no longer. To merge two of these trees, take one of the representative elements and set its parent to be the other.

In the following, let a find operation be the operation of getting the representative element.

This implementation still takes $$O(n)$$ per operation, but allows for improvements.

The first optimization is known as union by rank. We augment the node structure by adding a new parameter rank, which is initialized to 0 for all elements. When we want to merge two elements, we will first find the representative elements. If they are different, then check their ranks. If one is greater than the other, then set the parent of the node with smaller rank to be the node with greater rank. Otherwise, arbitrarily pick one node. Set its parent to be the other, and increment that other node's rank by 1. This heuristic attempts to add smaller trees to larger trees, meaning that when we need to run a find operation, the depth of the tree is as small as possible. Using this heuristic brings the running time to $$O(log n)$$ per operation, amortized.

The second optimization is called path compression. It attempts to flatten the tree structure as much as possible whenever a find operation is performed. The basic idea is that as we traverse the parent pointers to the root, since all of the elements found along the way share the same root, then the parent of those elements might as well be the root itself. Thus we perform one traversal to find out what the root, is, and then we perform a second traversal, setting each node's parent pointer to the root. This has the effect of majorly flattening the structure, decreasing the time required for future operations. Using this heuristic alone has a running time to $$O(log n)$$ per operation, amortized.

Note the use of the word "alone" in the above paragraph. What happens if both heuristics are used simultaneously? In this case, the amortized running time per operation is $$O(\alpha (n))$$, where $$\alpha$$ denotes the inverse of the Ackermann function. This function grows so slowly that for all practical values of $$n$$, $$\alpha(n) \leq 4$$. Effectively, this term is a small constant. In fact, it has been shown that this running time is optimal.