Introduction
In this article we shall cover one common problem in textual data processing and
that is how to match strings against patterns that may contain wildcard
characters. We will allow two kinds of wildcards: first one matching any single
character in the string and another one matching zero or more characters in the
string. This is the most common case of string patterns, even more common than
regular expressions, for example.
As will be demonstrated in the text below, the problem can be solved using
regular expressions and that solution is straight forward. But unfortunately, it
is not very efficient. Reason for that will be explained later, and it boils
down to the fact that regular expressions are targeting much more complex
transformations than just wildcard characters which causes performance loss.
We will develop a more efficient solution strictly for wildcard matching problem
and then compare its performance to regular expressions applied to same test
cases.
Mapping String Against Pattern with Wildcard Characters
If we suppose that only single character wildcard is allowed, then task of
matching a string against such pattern is quite simple. In that case, string
matches the pattern if both string and the pattern have the same length and
every character in the string is either equal to the corresponding character in
the pattern, or the corresponding character in the pattern is the wildcard. If
wildcard is a question mark, then function which matches string against such
pattern would look something like this:
bool
IsSimpleMatch(string input,
string pattern)
{
bool match = false;
if (input.Length == pattern.Length)
{
match =
true;
for (int
i = 0; i < input.Length; i++)
if (pattern[i] !=
'?' && input[i] != pattern[i])
{
match = false;
break;
}
}
return match;
}
From this simplified case we can conclude that real problems come from the use
of general wildcard, which can substitute any number of consecutive characters
from the string, including zero characters.
Now let's take a closer look at how this wildcard can be processed. If we try
with a naive approach, which is to find all possible substitutions for every
occurrence of the wildcard in the pattern, then we would face an exponential
growth in number of matching possibilities. So we need a better idea. Solution
presented in this article will break down the pattern string into blocks, each
block starting with a general wildcard and ranging either until the end of the
pattern is reached or until next general wildcard character is reached. Only the
first block in the pattern is allowed not to start with wildcard character. For
example, if we are matching strings against pattern BA*NA*S, then we can
recognize three parts of this pattern: BA, *NA and *S.
What we want to achieve now is to find all positions in the input string that
can be reached at the end of every block in the pattern as shown on the
following picture.
When algorithm completes execution, we will have to see if it succeeded to match
end of the string with the end of the last block in the pattern. If so, then
matching is successful; otherwise matching fails. Let's take a look at the
following picture which shows matching the string BANANAS against the mentioned
pattern.
Here we are gradually applying every part of the pattern and we are moving along
the input string until the end of the input string is finally reached. In fact,
this example has two solutions, as shown in the picture, but the algorithm would
terminate when the first solution is found This is because we only need the
information whether the string has been matched or not, rather than exact
matching between wildcard characters and parts of the input string.
Implementation
In this section we will present implementation of the algorithm which operates
as shown in the example on the previous picture. Function receives two strings:
the first one being the input string and the second one being the pattern. It
also receives two characters, one representing the single-character wildcard and
the second representing the multiple-character wildcard. In this way we can
cover different pattern systems, e.g. DOS-like which uses question mark and
asterisk (? and *), or SQL-like which uses underscore and percent (_ and %) for
the same purpose. We are not concerned what actual wildcard characters are as
long as user who initiates the pattern matching knows which characters to use as
wildcards.
The function which follows first removes the heading block of the pattern, the
one not starting with general wildcard, because that block must exactly match
beginning of the input string (which in turn might include one or more
occurrences of single-character wildcard, but as we have already shown, that
does not complicate the comparison like the general wildcards do). After that,
we have a remainder of the input string and the remainder of the pattern string
which must be matched. Pattern is now strictly made of blocks where each block
contains only one wildcard as its first letter.
Further on, the function keeps record of pairs of integers, one integer
indicating position in the pattern string and another integer indicating the
matching position in the input string. Position in pattern string will always
mark the ending of the block and position in the input string will be the
corresponding position in the input string. Each such pair determines part from
the beginning of the input string until the specified position, which has been
successfully matched against part of the pattern string from the beginning until
the specified position in the pattern string. So these pairs of integers keep
record of partial solutions to the problem - expanding them further leads us
towards the end of the input string, where we expecting to find the complete
matching.
Now the finest part of the algorithm comes, and that is a matrix of Boolean
values, which is used to keep record of all pairs of positions that have ever
been visited. This ensures that the same pair will not be processed twice as
long as the function runs. Final structure used in this function is the stack of
position pairs, which simply keeps record of pairs of positions that have been
reached and should be expanded further. Function terminates with either this
stack becoming empty, which indicates failure, or when pair connecting end of
the input string with end of the pattern string is pushed to the stack, which
indicates a success.
In every iteration of the algorithm, a pair of positions is popped off the stack
and then the following block from the pattern is matched against the rest of the
input string, starting from the position indicated in that pair. This situation
may produce multiple partial matches to be pushed to stack for further
examination.
Here is the fully commented source code of the matching function:
///
<summary>
///
Tests whether specified string can be matched agains provided pattern string.
Pattern may contain single- and multiple-replacing
///
wildcard characters.
///
</summary>
///
<param
name="input">String
which is matched against the pattern.</param>
///
<param
name="pattern">Pattern
against which string is matched.</param>
///
<param
name="singleWildcard">Character
which can be used to replace any single character in input string.</param>
///
<param
name="multipleWildcard">Character
which can be used to replace zero or more characters in input string.</param>
///
<returns>true
if
<paramref name="pat"/>
matches the string
<paramref name="str"/>;
otherwise false.</returns>
bool IsMatch(string
input, string pattern,
char singleWildcard, char
multipleWildcard)
{
int[] inputPosStack = new
int[(input.Length + 1) * (pattern.Length +
1)]; // Stack containing input positions that should
be tested for further matching
int[] patternPosStack =
new int[inputPosStack.Length];
// Stack containing pattern positions that should be
tested for further matching
int stackPos =
-1;
// Points to last occupied entry in stack; -1
indicates that stack is empty
bool[,] pointTested =
new bool[input.Length
+ 1, pattern.Length + 1];
// Each true value indicates that input position vs. pattern position has been
tested
int inputPos = 0; // Position in input
matched up to the first multiple wildcard in pattern
int patternPos = 0;
// Position in pattern matched up to
the first multiple wildcard in pattern
// Match beginning of the string until first multiple wildcard in pattern
while (inputPos < input.Length &&
patternPos < pattern.Length && pattern[patternPos] != multipleWildcard && (input[inputPos]
== pattern[patternPos] || pattern[patternPos] == singleWildcard))
{
inputPos++;
patternPos++;
}
// Push this position to stack if it points to end of pattern or to a general
wildcard
if (patternPos == pattern.Length ||
pattern[patternPos] == multipleWildcard)
{
pointTested[inputPos, patternPos] = true;
inputPosStack[++stackPos] = inputPos;
patternPosStack[stackPos] = patternPos;
}
bool matched =
false;
// Repeat matching until either string is matched against the pattern or no more
parts remain on stack to test
while (stackPos >= 0 && !matched)
{
inputPos =
inputPosStack[stackPos]; // Pop input and
pattern positions from stack
patternPos = patternPosStack[stackPos--];
// Matching will succeed if rest of the
input string matches rest of the pattern
if (inputPos == input.Length && patternPos ==
pattern.Length)
matched = true;
// Reached end of both pattern and input string, hence
matching is successful
else
{
// First character in next pattern block
is guaranteed to be multiple wildcard
// So skip it
and search for all matches in value string until next multiple wildcard
character is reached in pattern
for (int
curInputStart = inputPos; curInputStart < input.Length; curInputStart++)
{
int curInputPos = curInputStart;
int curPatternPos = patternPos +
1;
if (curPatternPos == pattern.Length)
{ // Pattern ends with multiple
wildcard, hence rest of the input string is matched with that character
curInputPos = input.Length;
}
else
{
while (curInputPos < input.Length &&
curPatternPos < pattern.Length && pattern[curPatternPos] != multipleWildcard &&
(input[curInputPos] == pattern[curPatternPos] ||
pattern[curPatternPos] == singleWildcard))
{
curInputPos++;
curPatternPos++;
}
}
// If we have reached next multiple wildcard character
in pattern without breaking the matching sequence, then we have another
candidate for full match
// This candidate should be
pushed to stack for further processing
// At the same time, pair
(input position, pattern position) will be marked as tested, so that it will not
be pushed to stack later again
if (((curPatternPos ==
pattern.Length && curInputPos == input.Length) || (curPatternPos <
pattern.Length && pattern[curPatternPos] == multipleWildcard))
&& !pointTested[curInputPos, curPatternPos])
{
pointTested[curInputPos, curPatternPos] =
true;
inputPosStack[++stackPos] = curInputPos;
patternPosStack[stackPos] = curPatternPos;
}
}
}
}
return matched;
}
Comparison with Regular
Expressions
In this section we will compare the matching function presented above with the
function which performs the same task using regular expressions. Here is the
function with which we will compare the solution:
bool
IsMatchRegex(string str,
string pat, char
singleWildcard, char multipleWildcard)
{
string escapedSingle = Regex.Escape(new
string(singleWildcard, 1));
string escapedMultiple =
Regex.Escape(new
string(multipleWildcard, 1));
pat =
Regex.Escape(pat);
pat = pat.Replace(escapedSingle, ".");
pat = "^" + pat.Replace(escapedMultiple,
".*") + "$";
Regex reg = new
Regex(pat);
return reg.IsMatch(str);
}
We have run both of the functions on same examples multiple times and measured
average time spent to match the patterns. Here is the output of the program
which has performed the comparison using the two functions:
Input
string |
Pattern |
Executions
count |
IsMatch
time (microsec/call) |
IsMatchRegex time (microsec/call) |
Something |
S*eth??g |
1,000,000 |
0.511 |
10.8 |
Something |
* |
1,000,000 |
0.375 |
6.8 |
A very long long long stringggggggg |
A *?string* |
1,000,000 |
1.36 |
12.7 |
Reg: Performance issue when using
WebSphere MQ 7.1 ,Window server 2008 R2 and java 1.6.0_21 |
Reg: Performance issue when using
*,Window server ???? R? and java *.*.*_* |
100,000 |
8.63 |
28.5 |
Reg: Performance issue when using
WebSphere MQ 7.1 ,Window server 2008 R2 and java 1.6.0_21 |
Reg: Performance* and java 1.6.0_21 |
100,000 |
6.07 |
16.0 |
Reg: Performance issue when using
WebSphere MQ 7.1 ,Window server 2008 R2 and java 1.6.0_21 |
Reg: Performance issue when using
*,Window server ???? R? and java *.*.*_ |
100,000 |
7.67 |
29.1 |
Last row in the table shows example with
unsuccessful matching, and all other examples are successfully matched. This
table shows that custom made function operates from 2.5 up to 20 times faster
than the function which relies on regular expressions.
This huge difference can be explained by pointing out that regular expressions
are much more general solution, with complex object model underneath, so when
applied to a simple wildcard matching problem their operation seems to be
needlessly complex. All that power is not used when only simple wildcards are
allowed in the pattern, but the result of applying such a complex solution is a
serious loss of performance.
Please note that custom function performance depends only on number of general
wildcard characters that appear in the pattern. As it can be seen from the test
data, the smaller the strings are and the smaller the number of general wildcard
characters is, the difference between performance of regular expressions and
custom solution becomes greater and greater.
Conclusion
In this article we have shown that strings can be matched against patterns with
wildcards using a relatively simple function which is much more efficient than
regular expressions. Using this method can be justified in applications that
heavily rely on string pattern matching. For example, application which iterates
through a gigabyte-sized text files (e.g. log files), matching every line
against number of patterns, might spend around a minute per pass, depending on
patterns applied, size of the file and disk speed. Such application would
certainly benefit from custom pattern matching solution if that would reduce the
execution time below ten seconds rather than a minute.