regular-expression

::: info Prerequisite Knowledge

Before reading this article, you need to study:

:::

Regular expressions are a very powerful tool. In this article, we will look at the basic idea behind regular expressions. LeetCode problem 10: Regular Expression Matchingarrow-up-right asks us to implement a simple regular matching algorithm, including the "." wildcard and the "*" wildcard.

These two wildcards are used most often. The dot "." can match any single character. The star "*" means the character before it can repeat any number of times (including 0 times).

For example, the pattern ".a*b" can match the text "zaaab" or "cb". The pattern "a..b" matches "amnb". The pattern ".*" can match any text.

The problem gives us two strings s and p. s is the text, and p is the pattern. You need to check if the pattern p can match the text s. You can assume the pattern only contains lowercase letters and the two wildcards above. The pattern will always be valid; there won't be cases like *a or b**.

The function signature is as follows:

boolean isMatch(string s, string p);

What is the hard part of this regular expression we need to implement?

The dot wildcard is easy. Any character in s can match a . in the pattern. The tricky part is the star wildcard. When you see "*", the character before it can be used any number of times: repeated once, many times, or not at all. What should we do?

The answer is simple: try all possible cases. If any case matches, then p matches s. Whenever we need to try all cases for two strings, we should think about using dynamic programming.

1. Idea Analysis

Let’s think about how s and p match. We use two pointers i and j to move along s and p. If both pointers reach the end, it means the match is successful. Otherwise, the match fails.

If we ignore the "*" wildcard, when matching characters s[i] and p[j], all we need to do is check if they match:

Now, if we add the "*" wildcard, things get a bit more complex, but we just need to consider each case separately.

When p[j + 1] is "*", let’s look at the cases:

  1. If s[i] == p[j], there are two possibilities:

1.1 p[j] might match multiple characters. For example, s = "aaa", p = "a*", then p[0] matches all three "a".

1.2 p[j] might match 0 characters. For example, s = "aa", p = "a*aa". Since the rest of the pattern can match s, p[0] can match 0 times.

  1. If s[i] != p[j], there is only one case:

p[j] can only match 0 times, and we see if the next part can match s[i]. For example, s = "aa", p = "b*aa", p[0] can only match 0 times.

So, we can update the code for the "*" wildcard as follows:

The idea is clear. But when we see the "*" wildcard, should we match 0 times or more? How many times?

This is a "choice" problem. We need to try all possible choices to get the answer. The core of dynamic programming is "state" and "choice". The "state" is the position of the two pointers i and j. The "choice" is how many times p[j] matches a character.

2. Dynamic Programming Solution

Based on the "state", we can define a dp function:

The definition of the dp function is as follows:

If dp(s, i, p, j) = true, it means s[i..] can match p[j..]. If dp(s, i, p, j) = false, it means s[i..] cannot match p[j..].

Based on this definition, the answer we want is the result of dp when i = 0 and j = 0. So we use the dp function like this:

We can write the main logic of the dp function based on the previous code:

According to the definition of the dp function, these cases are easy to explain:

1.1 Wildcard matches 0 or more times

Add 2 to j and keep i the same. This means we skip p[j] and the wildcard after it, which is the case where the wildcard matches 0 times.

Even if s[i] == p[j], this can still happen, as shown below:

Add 1 to i and keep j the same. This means p[j] matches s[i], but p[j] can match more, which is the case where the wildcard matches multiple times:

If either case leads to a match, we return true, so we use "or" to combine both.

1.2 Normal matching once

This is the case without a * wildcard. If s[i] == p[j], just add 1 to both i and j:

2.1 Wildcard matches 0 times

Similar to case 1.1, add 2 to j and keep i the same:

2.2 If there is no * wildcard and cannot match, it means match failed

Looking at the pictures makes it easy to understand. Now, let's think about the base cases of the dp function:

One base case is when j == p.length(). According to the definition of the dp function, it means the pattern string p has been matched. We need to see if the text string s has also been matched. If s is also matched, then it's a success:

Another base case is when i == s.length(). According to the dp definition, this means the text string s has been completely matched. Can we just check if p is also matched?

This is not correct. At this time, we cannot just check if j is equal to p.length(). As long as p[j..] can match an empty string, it counts as a successful match. For example, for s = "a", p = "ab*c*", when i reaches the end of s, j hasn't reached the end of p, but p can still match s.

So we can write code like this:

With this idea, we can write the complete code:

The code uses a hash table called memo to remove overlapping subproblems. How to spot overlapping subproblems? You can check Dynamic Programming Q&Aarrow-up-right for techniques. Let's look at the recursive framework of the regex algorithm:

If you want to get from dp(s, i, p, j) to dp(s, i+2, p, j+2), there are at least two paths: 1 -> 2 -> 2 and 3 -> 3. This means the state (i+2, j+2) will be calculated more than once. This shows there are overlapping subproblems, so we need a memo to remove them and make the code more efficient.

The time complexity of dynamic programming is "number of states" * "time per recursion". In this problem, the number of states is the combinations of i and j, which is M * N (M is the length of s, N is the length of p). The dp function has no loop (except for the base case, which doesn't happen often), so each recursion takes constant time. Multiply them, and the total time complexity is $O(MN)$.

The space complexity is just the size of the memo, which is $O(MN)$.

Last updated