Classic DP: Edit Distance

Translator: Master-cai

Author: labuladong

Few days ago, I saw an interview paper of Tencent. In this paper, most of the algorithm problems are Dynamic programming. The last question is to write a function which calculates the shortest Editing distance. Today I wrote an article specifically to discuss this problem.

I personally like this problem because it looks very hard, yet the solution is surprisingly simple and beautiful. Though it's a rare algorithm, it's is very useful (yech, I recognize that many algorithm problems are not very useful). The problem is as follows:

Why did I say this problem is hard? Obviously, it's just hard, making people helpless and frightened.

And why did I say this problem is useful? Because a few days ago I used the algorithm in my daily life. I had an article in my Wechat Official Account and I wrote some words out of place by mistake. So I decided to modify this part to make the logic suitable. However, the Wechat Official Account article can only be modified 20 words at most, and it only supports addition, deletion and replacement (exactly same as the editing distance problem). So I used the algorithm to find the best way to solve the problem in just 16 steps.

Another advanced example is that the edit distance can be used to measure the similarity of two DNA sequences. The DNA sequence is a sequence consisting of A, G, C and T, which is similar to a string. The shorter the editing distance is, the more similar the two DNA are. Maybe the owners of these DNAs were ancient relatives.

Let's get to the point, I will explain you how to edit the distance in detail, and I hope you could obtain something fruitful.

1. train of thought

The editing distance is a problem where we are given two strings s1 and s2 with only three operations, and we have to transform s1 into s2 in fewest steps. The first thing to be sure of is that the results of s1 to s2 and s2 to s1 are the same. So we will use s1 to s2 as an example.

Mentioned in the early paper "The longest common subsequence", I said that to solve the dynamic programming problem of two strings, we normally use two pointers i, j to point to the ends of the two strings, and then go forward step by step to reduce the size of the problem.

Assuming that the two strings are "rad" and "apple", in order to change s1 tos2, the algorithm works like this:

Remember this gif in order to solve the editing distance problem. The key is how to make

the right operation which I will discuss later.

According to the above gif, we can figure out that there are not only three operations; in fact, there is a fourth operation which is skip. For example:

As the two characters are the same, obviously there should be no operation to minimize the distance. Just move i, j.

Another simple situation is when j has finished s2, if i has not finished s1, then you can only delete s1 to make them the same. For example:

Similarly, if i finished s1 and j has not finished s2, you can only insert all the remaining characters of s2 into s1. As you see, the two cases are the base case of the algorithm.

Let's look at how to change your ideas into code. Sit tight, it's time to go.

2. code in detail

First we sort out our ideas:

The base case is when i finished s1 or j finished s2, we can directly return the remaining length of the other string.

For each pair of characters, s1[i] and s2[j], there are four operations:

if s1[i] == s2[j]:
    skip
    i, j move forward
else:
    chose:
        insert
        delete
        replace

With this framework, the problem has been solved. Maybe you will ask, how to chose the "three choices"? It's very simple: try it all, and chose the smallest one. we need some recursive skills here. Look at the code:

def minDistance(s1, s2) -> int:

    def dp(i, j):
        # base case
        if i == -1: return j + 1
        if j == -1: return i + 1

        if s1[i] == s2[j]:
            return dp(i - 1, j - 1)  # skip
        else:
            return min(
                dp(i, j - 1) + 1,    # insert
                dp(i - 1, j) + 1,    # delete
                dp(i - 1, j - 1) + 1 # replace
            )

    # i,j initialize to the last index
    return dp(len(s1) - 1, len(s2) - 1)

Let's explain this recursive code in detail. There is no need to explain the base case, so I'll mainly explain the recursive part.

It is said that recursive code is very interpretable. It does make sense. As long as you understand the definition of a function, you can clearly understand the logic of the algorithm. The function dp(i, j) is defined like this:

def dp(i, j) -> int
# return the least editing distance s1[0..i] and s2[0..j]

Remember this definition, let's look at the code:

if s1[i] == s2[j]:
    return dp(i - 1, j - 1)  # skip
# explanation:
# already the same, no need of any operations
# the least editing distance of s1[0..i] and s2[0..j] equals
# the least distance of s1[0..i-1] and s2[0..j-1]
# It means that dp(i, j) equals dp(i-1, j-1)

if s1[i] != s2[j], we should recurse the three operations which needs a bit of thing:

dp(i, j - 1) + 1,    # insert
# explanation:
# Directly insert the same character as s2[j] at s1[i]
# then s2[j] is matched,move forward j,and continue comparing with i
# Don't forget to add one to the number of operations
dp(i - 1, j) + 1,    # delete
# explanation:
# Directly delete s[i]
# move i forward,continue comparing with j
# add one to the number of operations
dp(i - 1, j - 1) + 1 # replace
# explanation:
# Directly replace s1[i] with s2[j], then they are matched
# move forward i,j and continue comparing
# add one to the number of operations

Now, you should fully understand this short and clever code. Another small problem is that this is a brute force solution. There are many overlapping subproblems, which should be optimized by dynamic programming techniques.

How can we see the overlapping subproblems at a glance? As mentioned in the previous article "Regular Expressions for Dynamic Programming", we need to abstract the recursive framework of the algorithm in this article:

def dp(i, j):
    dp(i - 1, j - 1) #1
    dp(i, j - 1)     #2
    dp(i - 1, j)     #3

For the subproblem dp(i-1, j-1), how can we get it from the original question dp(i, j)? Once we found a repetitive path, it means that there is a huge number of repetitive paths, which is the overlapping subproblem. For example: dp(i, j)-> #1 and dp(i, j)->#2->#3.

3. Optimized by Dynamic programming

For the overlapping subproblems, we introduced in the previous article "Detailed Explanation of Dynamic Programming" in detailed. The optimization is nothing more than a memo or a DP table.

The memo is easy to append, just modified the original code slightly.

def minDistance(s1, s2) -> int:

    memo = dict() # memo
    def dp(i, j):
        if (i, j) in memo: 
            return memo[(i, j)]
        ...

        if s1[i] == s2[j]:
            memo[(i, j)] = ...  
        else:
            memo[(i, j)] = ...
        return memo[(i, j)]

    return dp(len(s1) - 1, len(s2) - 1)

We mainly explain the DP table solution.

First, we declare the meaning of the dp array. The dp array is a two-dimensional array, which looks like this:

With the foundation of the previous recursive solution, it's easy to understand. dp [..][0] and dp [0][..] correspond to the base case. The meaning of dp [i][j] is similar to the previous dp function:

def dp(i, j) -> int
# return the least editing distance of s1[0..i] and s2[0..j]

dp[i-1][j-1]
# store the least editing distance of s1[0..i] and s2[0..j]

The base case of the dp function is that i, j is equal to -1. However, the array index is at least 0, so the dp array is offset by one position.

Since the dp array has the same meaning as the recursive dp function, you can directly apply the previous ideas to write code. The only difference is that the DP table is solved bottom-up, and the recursive solution is solved top-down:

int minDistance(String s1, String s2) {
    int m = s1.length(), n = s2.length();
    int[][] dp = new int[m + 1][n + 1];
    // base case 
    for (int i = 1; i <= m; i++)
        dp[i][0] = i;
    for (int j = 1; j <= n; j++)
        dp[0][j] = j;
    // from the bottom up
    for (int i = 1; i <= m; i++)
        for (int j = 1; j <= n; j++)
            if (s1.charAt(i-1) == s2.charAt(j-1))
                dp[i][j] = dp[i - 1][j - 1];
            else               
                dp[i][j] = min(
                    dp[i - 1][j] + 1,
                    dp[i][j - 1] + 1,
                    dp[i-1][j-1] + 1
                );
    // store the least editing distance of s1 and s2
    return dp[m][n];
}

int min(int a, int b, int c) {
    return Math.min(a, Math.min(b, c));
}

4. Extension

Generally speaking, when dealing with the dynamic programming of two strings, we just follow the ideas of this article, making the DP table. Why? Because it's easy to find out the relationship of the state transitions, such as the DP table of the edit distance:

There is another detail: since every dp[i][j] is only related to the three status, the space complexity can be reduced to O(min(M, N)) (M, N is the length of the two strings). It's not very difficult but the code is harder to read. You can try to optimize it by yourself.

You may also ask, As we only found the minimum editing distance, how can we know the every step? In the example of modifying the article you mentioned earlier, only a editing distance is definitely not enough. You must know how to modify it.

Actually, it's very simple, just slightly modify the code and add additional information to the dp array:

// int[][] dp;
Node[][] dp;

class Node {
    int val;
    int choice;
    // 0 skip
    // 1 insert
    // 2 delete
    // 3 replace
}

The val attribute is the value of the previous dp array, and the choice attribute represents the operation. When making the best choice, record the operation and then infer the specific operation from the result.

Our final result is dp [m] [n], where val holds the minimum edit distance, and choice holds the last operation, such as the insert operation, then you can move one space to the left:

Repeating this process, you can return to the starting point dp [0] [0] step by step to form a path. Editing according to the operations on this path is the best solution.

The above is the entire content of the edit distance algorithm.

Last updated