Efficient String Matching Algorithm with Use of Wildcard Characters

Zoran Horvat
13y
109.5k
0
2

Article

Introduction

In this article we shall cover one common problem in textual data processing and that is how to match strings against patterns that may contain wildcard characters. We will allow two kinds of wildcards: first one matching any single character in the string and another one matching zero or more characters in the string. This is the most common case of string patterns, even more common than regular expressions, for example.

As will be demonstrated in the text below, the problem can be solved using regular expressions and that solution is straight forward. But unfortunately, it is not very efficient. Reason for that will be explained later, and it boils down to the fact that regular expressions are targeting much more complex transformations than just wildcard characters which causes performance loss.

We will develop a more efficient solution strictly for wildcard matching problem and then compare its performance to regular expressions applied to same test cases.

Mapping String Against Pattern with Wildcard Characters

If we suppose that only single character wildcard is allowed, then task of matching a string against such pattern is quite simple. In that case, string matches the pattern if both string and the pattern have the same length and every character in the string is either equal to the corresponding character in the pattern, or the corresponding character in the pattern is the wildcard. If wildcard is a question mark, then function which matches string against such pattern would look something like this:

bool IsSimpleMatch(string input, string pattern)
{

bool match = false;

if (input.Length == pattern.Length)
{

        match = true;
        for (int i = 0; i < input.Length; i++)
            if (pattern[i] != '?' && input[i] != pattern[i])
            {
                match = false;
                break;
            }

}

return match;

}

From this simplified case we can conclude that real problems come from the use of general wildcard, which can substitute any number of consecutive characters from the string, including zero characters.

Now let's take a closer look at how this wildcard can be processed. If we try with a naive approach, which is to find all possible substitutions for every occurrence of the wildcard in the pattern, then we would face an exponential growth in number of matching possibilities. So we need a better idea. Solution presented in this article will break down the pattern string into blocks, each block starting with a general wildcard and ranging either until the end of the pattern is reached or until next general wildcard character is reached. Only the first block in the pattern is allowed not to start with wildcard character. For example, if we are matching strings against pattern BA*NA*S, then we can recognize three parts of this pattern: BA, *NA and *S.

What we want to achieve now is to find all positions in the input string that can be reached at the end of every block in the pattern as shown on the following picture.

Algorithm with Use of Wildcard Characters

When algorithm completes execution, we will have to see if it succeeded to match end of the string with the end of the last block in the pattern. If so, then matching is successful; otherwise matching fails. Let's take a look at the following picture which shows matching the string BANANAS against the mentioned pattern.

Algorithm with Use of Wildcard Characters

Here we are gradually applying every part of the pattern and we are moving along the input string until the end of the input string is finally reached. In fact, this example has two solutions, as shown in the picture, but the algorithm would terminate when the first solution is found This is because we only need the information whether the string has been matched or not, rather than exact matching between wildcard characters and parts of the input string.

Implementation

In this section we will present implementation of the algorithm which operates as shown in the example on the previous picture. Function receives two strings: the first one being the input string and the second one being the pattern. It also receives two characters, one representing the single-character wildcard and the second representing the multiple-character wildcard. In this way we can cover different pattern systems, e.g. DOS-like which uses question mark and asterisk (? and *), or SQL-like which uses underscore and percent (_ and %) for the same purpose. We are not concerned what actual wildcard characters are as long as user who initiates the pattern matching knows which characters to use as wildcards.

The function which follows first removes the heading block of the pattern, the one not starting with general wildcard, because that block must exactly match beginning of the input string (which in turn might include one or more occurrences of single-character wildcard, but as we have already shown, that does not complicate the comparison like the general wildcards do). After that, we have a remainder of the input string and the remainder of the pattern string which must be matched. Pattern is now strictly made of blocks where each block contains only one wildcard as its first letter.

Further on, the function keeps record of pairs of integers, one integer indicating position in the pattern string and another integer indicating the matching position in the input string. Position in pattern string will always mark the ending of the block and position in the input string will be the corresponding position in the input string. Each such pair determines part from the beginning of the input string until the specified position, which has been successfully matched against part of the pattern string from the beginning until the specified position in the pattern string. So these pairs of integers keep record of partial solutions to the problem - expanding them further leads us towards the end of the input string, where we expecting to find the complete matching.

Now the finest part of the algorithm comes, and that is a matrix of Boolean values, which is used to keep record of all pairs of positions that have ever been visited. This ensures that the same pair will not be processed twice as long as the function runs. Final structure used in this function is the stack of position pairs, which simply keeps record of pairs of positions that have been reached and should be expanded further. Function terminates with either this stack becoming empty, which indicates failure, or when pair connecting end of the input string with end of the pattern string is pushed to the stack, which indicates a success.

In every iteration of the algorithm, a pair of positions is popped off the stack and then the following block from the pattern is matched against the rest of the input string, starting from the position indicated in that pair. This situation may produce multiple partial matches to be pushed to stack for further examination.

Here is the fully commented source code of the matching function:

/// <summary>
/// Tests whether specified string can be matched agains provided pattern string. Pattern may contain single- and multiple-replacing
/// wildcard characters.
/// </summary>
/// <param name="input">String which is matched against the pattern.</param>
/// <param name="pattern">Pattern against which string is matched.</param>
/// <param name="singleWildcard">Character which can be used to replace any single character in input string.</param>
/// <param name="multipleWildcard">Character which can be used to replace zero or more characters in input string.</param>
/// <returns>true if <paramref name="pat"/> matches the string <paramref name="str"/>; otherwise false.</returns>
bool IsMatch(string input, string pattern, char singleWildcard, char multipleWildcard)
{

    int[] inputPosStack = new int[(input.Length + 1) * (pattern.Length + 1)];   // Stack containing input positions that should be tested for further matching
    int[] patternPosStack = new int[inputPosStack.Length];                      // Stack containing pattern positions that should be tested for further matching
    int stackPos = -1;                                                          // Points to last occupied entry in stack; -1 indicates that stack is empty
    bool[,] pointTested = new bool[input.Length + 1, pattern.Length + 1];       // Each true value indicates that input position vs. pattern position has been tested

int inputPos = 0; // Position in input matched up to the first multiple wildcard in pattern
int patternPos = 0; // Position in pattern matched up to the first multiple wildcard in pattern

    // Match beginning of the string until first multiple wildcard in pattern
    while (inputPos < input.Length && patternPos < pattern.Length && pattern[patternPos] != multipleWildcard && (input[inputPos] == pattern[patternPos] || pattern[patternPos] == singleWildcard))
    {
        inputPos++;
        patternPos++;
    }

    // Push this position to stack if it points to end of pattern or to a general wildcard
    if (patternPos == pattern.Length || pattern[patternPos] == multipleWildcard)
    {
        pointTested[inputPos, patternPos] = true;
        inputPosStack[++stackPos] = inputPos;
        patternPosStack[stackPos] = patternPos;
    }
    bool matched = false;

    // Repeat matching until either string is matched against the pattern or no more parts remain on stack to test
    while (stackPos >= 0 && !matched)
    {

inputPos = inputPosStack[stackPos]; // Pop input and pattern positions from stack
patternPos = patternPosStack[stackPos--]; // Matching will succeed if rest of the input string matches rest of the pattern

        if (inputPos == input.Length && patternPos == pattern.Length)
            matched = true;     // Reached end of both pattern and input string, hence matching is successful
        else
        {
            // First character in next pattern block is guaranteed to be multiple wildcard
            // So skip it and search for all matches in value string until next multiple wildcard character is reached in pattern

for (int curInputStart = inputPos; curInputStart < input.Length; curInputStart++)
{

int curInputPos = curInputStart;
int curPatternPos = patternPos + 1;

                if (curPatternPos == pattern.Length)
                {   // Pattern ends with multiple wildcard, hence rest of the input string is matched with that character
                    curInputPos = input.Length;
                }
                else
                {

                    while (curInputPos < input.Length && curPatternPos < pattern.Length && pattern[curPatternPos] != multipleWildcard &&
                        (input[curInputPos] == pattern[curPatternPos] || pattern[curPatternPos] == singleWildcard))
                    {
                        curInputPos++;
                        curPatternPos++;
                    }

}

                // If we have reached next multiple wildcard character in pattern without breaking the matching sequence, then we have another candidate for full match
                // This candidate should be pushed to stack for further processing
                // At the same time, pair (input position, pattern position) will be marked as tested, so that it will not be pushed to stack later again
                if (((curPatternPos == pattern.Length && curInputPos == input.Length) || (curPatternPos < pattern.Length && pattern[curPatternPos] == multipleWildcard))
                    && !pointTested[curInputPos, curPatternPos])
                {
                    pointTested[curInputPos, curPatternPos] = true;
                    inputPosStack[++stackPos] = curInputPos;
                    patternPosStack[stackPos] = curPatternPos;
                }

}
}

}

return matched;

}

Comparison with Regular Expressions

In this section we will compare the matching function presented above with the function which performs the same task using regular expressions. Here is the function with which we will compare the solution:

bool IsMatchRegex(string str, string pat, char singleWildcard, char multipleWildcard)
{

string escapedSingle = Regex.Escape(new string(singleWildcard, 1));
string escapedMultiple = Regex.Escape(new string(multipleWildcard, 1));

    pat = Regex.Escape(pat);
    pat = pat.Replace(escapedSingle, ".");
    pat = "^" + pat.Replace(escapedMultiple, ".*") + "$";

Regex reg = new Regex(pat);

return reg.IsMatch(str);

}

We have run both of the functions on same examples multiple times and measured average time spent to match the patterns. Here is the output of the program which has performed the comparison using the two functions:

Input string	Pattern	Executions count	IsMatch time (microsec/call)	IsMatchRegex time (microsec/call)
Something	S*eth??g	1,000,000	0.511	10.8
Something	*	1,000,000	0.375	6.8
A very long long long stringggggggg	A ?string	1,000,000	1.36	12.7
Reg: Performance issue when using WebSphere MQ 7.1 ,Window server 2008 R2 and java 1.6.0_21	Reg: Performance issue when using ,Window server ???? R? and java .._*	100,000	8.63	28.5
Reg: Performance issue when using WebSphere MQ 7.1 ,Window server 2008 R2 and java 1.6.0_21	Reg: Performance* and java 1.6.0_21	100,000	6.07	16.0
Reg: Performance issue when using WebSphere MQ 7.1 ,Window server 2008 R2 and java 1.6.0_21	Reg: Performance issue when using ,Window server ???? R? and java .._	100,000	7.67	29.1

Last row in the table shows example with unsuccessful matching, and all other examples are successfully matched. This table shows that custom made function operates from 2.5 up to 20 times faster than the function which relies on regular expressions.

This huge difference can be explained by pointing out that regular expressions are much more general solution, with complex object model underneath, so when applied to a simple wildcard matching problem their operation seems to be needlessly complex. All that power is not used when only simple wildcards are allowed in the pattern, but the result of applying such a complex solution is a serious loss of performance.

Please note that custom function performance depends only on number of general wildcard characters that appear in the pattern. As it can be seen from the test data, the smaller the strings are and the smaller the number of general wildcard characters is, the difference between performance of regular expressions and custom solution becomes greater and greater.

Conclusion

In this article we have shown that strings can be matched against patterns with wildcards using a relatively simple function which is much more efficient than regular expressions. Using this method can be justified in applications that heavily rely on string pattern matching. For example, application which iterates through a gigabyte-sized text files (e.g. log files), matching every line against number of patterns, might spend around a minute per pass, depending on patterns applied, size of the file and disk speed. Such application would certainly benefit from custom pattern matching solution if that would reduce the execution time below ten seconds rather than a minute.