regular-expression

::: info Prerequisite Knowledge

Before reading this article, you need to study:

:::

Regular expressions are a very powerful tool. In this article, we will look at the basic idea behind regular expressions. LeetCode problem 10: Regular Expression Matching asks us to implement a simple regular matching algorithm, including the "." wildcard and the "*" wildcard.

These two wildcards are used most often. The dot "." can match any single character. The star "*" means the character before it can repeat any number of times (including 0 times).

For example, the pattern ".a*b" can match the text "zaaab" or "cb". The pattern "a..b" matches "amnb". The pattern ".*" can match any text.

The problem gives us two strings s and p. s is the text, and p is the pattern. You need to check if the pattern p can match the text s. You can assume the pattern only contains lowercase letters and the two wildcards above. The pattern will always be valid; there won't be cases like *a or b**.

The function signature is as follows:

boolean isMatch(string s, string p);

What is the hard part of this regular expression we need to implement?

The dot wildcard is easy. Any character in s can match a . in the pattern. The tricky part is the star wildcard. When you see "*", the character before it can be used any number of times: repeated once, many times, or not at all. What should we do?

The answer is simple: try all possible cases. If any case matches, then p matches s. Whenever we need to try all cases for two strings, we should think about using dynamic programming.

1. Idea Analysis

Let’s think about how s and p match. We use two pointers i and j to move along s and p. If both pointers reach the end, it means the match is successful. Otherwise, the match fails.

If we ignore the "*" wildcard, when matching characters s[i] and p[j], all we need to do is check if they match:

boolean isMatch(String s, String p) {
    int i = 0, j = 0;
    while (i < s.length() && j < p.length()) {
        // the '.' wildcard is a versatile match
        if (s.charAt(i) == p.charAt(j) || p.charAt(j) == '.') {
            // match, continue to match s[i+1..] and p[j+1..]
            i++; j++;
        } else {
            // not a match
            return false;
        }
    }
    return i == j;
}

Now, if we add the "*" wildcard, things get a bit more complex, but we just need to consider each case separately.

When p[j + 1] is "*", let’s look at the cases:

If s[i] == p[j], there are two possibilities:

1.1 p[j] might match multiple characters. For example, s = "aaa", p = "a*", then p[0] matches all three "a".

1.2 p[j] might match 0 characters. For example, s = "aa", p = "a*aa". Since the rest of the pattern can match s, p[0] can match 0 times.

If s[i] != p[j], there is only one case:

p[j] can only match 0 times, and we see if the next part can match s[i]. For example, s = "aa", p = "b*aa", p[0] can only match 0 times.

So, we can update the code for the "*" wildcard as follows:

if (s.charAt(i) == p.charAt(j) || p.charAt(j) == '.') {
    // match
    if (j < p.length() - 1 && p.charAt(j + 1) == '*') {
        // there is a * wildcard, which can match 0 or more times
    } else {
        // no * wildcard, match exactly once
        i++; j++;
    }
} else {
    // do not match
    if (j < p.length() - 1 && p.charAt(j + 1) == '*') {
        // there is a * wildcard, can only match 0 times
    } else {
        // no * wildcard, the match cannot proceed
        return false;
    }
}

The idea is clear. But when we see the "*" wildcard, should we match 0 times or more? How many times?

This is a "choice" problem. We need to try all possible choices to get the answer. The core of dynamic programming is "state" and "choice". The "state" is the position of the two pointers i and j. The "choice" is how many times p[j] matches a character.

2. Dynamic Programming Solution

Based on the "state", we can define a dp function:

boolean dp(String s, int i, String p, int j);

The definition of the dp function is as follows:

If dp(s, i, p, j) = true, it means s[i..] can match p[j..]. If dp(s, i, p, j) = false, it means s[i..] cannot match p[j..].

Based on this definition, the answer we want is the result of dp when i = 0 and j = 0. So we use the dp function like this:

boolean isMatch(String s, String p) {
    // pointers i and j start moving from index 0
    return dp(s, 0, p, 0);
}

We can write the main logic of the dp function based on the previous code:

// core logic of the dp function (pseudocode)
// definition: the function returns whether s[i..] can match p[j..]
boolean dp(String s, int i, String p, int j) {
    if (s.charAt(i) == p.charAt(j) || p.charAt(j) == '.') {
        // match
        if (j < p.length() - 1 && p.charAt(j + 1) == '*') {
            // 1.1 wildcard matches 0 times or multiple times
            return dp(s, i, p, j + 2) || dp(s, i + 1, p, j);
        } else {
            // 1.2 regular match 1 time
            return dp(s, i + 1, p, j + 1);
        }
    } else {
        // no match
        if (j < p.length() - 1 && p.charAt(j + 1) == '*') {
            // 2.1 wildcard matches 0 times
            return dp(s, i, p, j + 2);
        } else {
            // 2.2 cannot continue matching
            return false;
        }
    }
}

According to the definition of the dp function, these cases are easy to explain:

1.1 Wildcard matches 0 or more times

Add 2 to j and keep i the same. This means we skip p[j] and the wildcard after it, which is the case where the wildcard matches 0 times.

Even if s[i] == p[j], this can still happen, as shown below:

Add 1 to i and keep j the same. This means p[j] matches s[i], but p[j] can match more, which is the case where the wildcard matches multiple times:

If either case leads to a match, we return true, so we use "or" to combine both.

1.2 Normal matching once

This is the case without a * wildcard. If s[i] == p[j], just add 1 to both i and j:

2.1 Wildcard matches 0 times

Similar to case 1.1, add 2 to j and keep i the same:

2.2 If there is no * wildcard and cannot match, it means match failed

Looking at the pictures makes it easy to understand. Now, let's think about the base cases of the dp function:

One base case is when j == p.length(). According to the definition of the dp function, it means the pattern string p has been matched. We need to see if the text string s has also been matched. If s is also matched, then it's a success:

if (j == p.length()) {
    return i == s.length();
}

Another base case is when i == s.length(). According to the dp definition, this means the text string s has been completely matched. Can we just check if p is also matched?

if (i == s.length()) {
    // Is this enough?
    return j == p.length();
}

This is not correct. At this time, we cannot just check if j is equal to p.length(). As long as p[j..] can match an empty string, it counts as a successful match. For example, for s = "a", p = "ab*c*", when i reaches the end of s, j hasn't reached the end of p, but p can still match s.

So we can write code like this:

int m = s.length(), n = p.length();

if (i == s.length()) {
    // if it can match an empty string, characters and * must appear in pairs
    if ((n - j) % 2 == 1) {
        return false;
    }
    // check if it is in the form of x*y*z*
    for (; j + 1 < p.length(); j += 2) {
        if (p.charAt(j + 1) != '*') {
            return false;
        }
    }
    return true;
}

With this idea, we can write the complete code:

class Solution {
    // memoization
    private int[][] memo;

    public boolean isMatch(String s, String p) {
        int m = s.length(), n = p.length();
        memo = new int[m][n];
        for (int[] row : memo) {
            Arrays.fill(row, -1);
        }
        // pointers i and j start moving from index 0
        return dp(s, 0, p, 0);
    }

    // calculate whether p[j..] matches s[i..]
    private boolean dp(String s, int i, String p, int j) {
        int m = s.length(), n = p.length();
        // base case
        if (j == n) {
            return i == m;
        }
        if (i == m) {
            if ((n - j) % 2 == 1) {
                return false;
            }
            for (; j + 1 < n; j += 2) {
                if (p.charAt(j + 1) != '*') {
                    return false;
                }
            }
            return true;
        }

        // check memo to avoid redundant calculations
        if (memo[i][j] != -1) {
            return memo[i][j] == 1;
        }

        boolean res = false;

        if (s.charAt(i) == p.charAt(j) || p.charAt(j) == '.') {
            if (j < n - 1 && p.charAt(j + 1) == '*') {
                res = dp(s, i, p, j + 2)
                        || dp(s, i + 1, p, j);
            } else {
                res = dp(s, i + 1, p, j + 1);
            }
        } else {
            if (j < n - 1 && p.charAt(j + 1) == '*') {
                res = dp(s, i, p, j + 2);
            } else {
                res = false;
            }
        }
        // store the current result in the memo
        memo[i][j] = res ? 1 : 0;
        return res;
    }
}

The code uses a hash table called memo to remove overlapping subproblems. How to spot overlapping subproblems? You can check Dynamic Programming Q&A for techniques. Let's look at the recursive framework of the regex algorithm:

boolean dp(String s, int i, String p, int j) {
    dp(s, i, p, j + 2);     // 1
    dp(s, i + 1, p, j);     // 2
    dp(s, i + 1, p, j + 1); // 3
    dp(s, i, p, j + 2);     // 4
}

If you want to get from dp(s, i, p, j) to dp(s, i+2, p, j+2), there are at least two paths: 1 -> 2 -> 2 and 3 -> 3. This means the state (i+2, j+2) will be calculated more than once. This shows there are overlapping subproblems, so we need a memo to remove them and make the code more efficient.

The time complexity of dynamic programming is "number of states" * "time per recursion". In this problem, the number of states is the combinations of i and j, which is M * N (M is the length of s, N is the length of p). The dp function has no loop (except for the base case, which doesn't happen often), so each recursion takes constant time. Multiply them, and the total time complexity is $O(MN)$.

The space complexity is just the size of the memo, which is $O(MN)$.

Previousoptimal-substructure Nextstate-compression

Last updated 3 days ago

hashtag1. Idea Analysis

hashtag2. Dynamic Programming Solution

1. Idea Analysis

2. Dynamic Programming Solution