Longest common subsequence

Longest Common Subsequence is the problem of finding the longest common subsequence of two sequences of items. This is used in the "diff" file comparison utility.

The solution utilizes dynamic programming.

Overview
The problem is usually defined as:

Given two sequence of items, find the longest subsequence present in both of them. A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous. For example, in the string abcdefg, "abc", "abg", "bdf", "aeg" are all subsequences.

A naive exponential algorithm is to notice that a string of length $$n$$ has $$O(2^n)$$ different subsequences, so we can take the shorter string, and test each of its subsequences for presence in the other string, greedily.

Recursive solution
We can try to solve the problem in terms of smaller subproblems. We are given two strings x and y, of length n and m respectively. We solve the problem of finding the longest common subsequence of $$x=x_{1...n}$$ and $$y=y_{1...m}$$ by taking the best of the three possible cases:


 * 1) The longest common subsequence of the strings $$x_{1...n-1}$$ and $$y_{1...m}$$
 * 2) The longest common subsequence of the strings $$x_{1...n}$$ and $$y_{1...m-1}$$
 * 3) If $$x_n$$ is the same as $$y_m$$, the longest common subsequence of the strings $$x_{1...n-1}$$ and $$y_{1...m-1}$$, followed by the common last character.

The base case: when one of the sequences is empty, their only common subsequence is the empty sequence of length 0.

It is easy to construct a recursive solution from this (in Python):

and this is in C++ : --Mohamed Essam Arafa 15:57, 5 December 2012 (EST)

and this is in C :

Dynamic programming
Obviously, this is still not very efficient. But because the subproblems are repeated, we can use memoization. An even more (slightly) efficient way, which avoids the overhead of function calls, is to order the computation in such a way that whenever the results of subproblems are needed, they have already been computed, and can simply be looked up in a table. This is called Dynamic Programming.

In this case, we find $$lcs(x_{1..i},y_{1..j})$$ for every $$i$$ and $$j$$, starting from smaller ones, storing the results in an array at index (i,j) as we go along.

Notice how closely it parallels the recursive solution above, while entirely eliminating recursive calls. This "small" change makes the difference between exponential time and polynomial time.