Introduction
In previous article titled "Efficient String Matching Algorithm with Use of
Wildcard Characters" (http://www.c-sharpcorner.com/UploadFile/b81385/8832/)
we have presented one method to compare strings against patterns which contain
wildcard characters. In this article we are presenting classes which can be used
to formalize the string comparison. Applications can offer several comparison
methods and then let the caller decide which one to use in every call to a
function. Classes shown in this article can help build such functionality almost
without effort.
Please note that latest documentation and source code for this library is available at address http://www.sysexpand.com/?path=net/sysexpand/text/strmatch/manual/intro. All subsequent changes and enhancements will be performed at this address rather than in the article.
Problem Statement
It is a common requirement in application programming that application should at
some point test whether a string matches certain pattern or to extract members
of a set of strings which match the pattern. In most of the cases in practice
pattern matching can be defined as one of the three types:
- Exact - string must exactly match the pattern,
- Matching with pattern which contains wildcards - pattern may contain wildcards that replace one or multiple characters from the string, or
- Matching the regular expression - string is required to match the given regular expression.
What often arises as the problem when
developing such applications is that matching method is not determined in
advance, but it becomes known only at run time, typically being decided by the
application user.
This article presents ready to use classes that implement these three string
matching methods and their combinations, so that matching method can be changed
dynamically at run time without changing the rest of the application.
In previous article named Efficient Algorithm for String Matching with Use of
Wildcard Characters we have already explained a method applied to match strings
against patterns that contain wildcards. Function presented in that article will
be used in implementation of the corresponding class in this article.
Design
Classes that will be used in string matching will all implement interface named
IStringMatcher, which is declared like this:
public
interface
IStringMatcher
{
string Pattern {
get; set; }
bool IsMatch(string
value);
}
All types that implement this interface will expose a read-write property
Pattern. This property is writeable so that one pattern matching object can be
used on multiple patterns and multiple input strings during its lifetime. Making
this property writeable allows application to create only one instance of
appropriate class which implements IStringMatcher interface and then to use it
for all matching. In other words, every class implementing IStringMatcher
interface only captures the matching algorithm, rather than a particular pattern
against which strings are matched using that algorithm. The second member of the
interface is the IsMatch method, which is used to test whether the contained
pattern matches the string passed as the parameter.
We are providing three basic implementations of this interface, named
ExactMatcher, WildcardMatcher and RegexMatcher. As their names imply, these
classes implement the three matching methods listed in the introduction. In
addition to members defined by the IStringMatcher interface, WildcardMatcher
class also provides properties that can be used to set wildcard characters used
in the pattern, because these characters may vary across different systems.
There is one more class provided, named CombinedStringMatcher, which also
implements IStringMatcher interface. Instances of this class can contain
multiple instances of IStringMatcher and its IsMatch method returns true if at
least one contained matcher's IsMatch method returns true. Each of the contained
matchers can generally be of different class and can have its own pattern
different than the others. In this way we can ask whether a given string can be
matched against any of the patterns given.
Implementation
In this section we are providing full source code of all utility classes that
implement IStringMatcher interface. For explanations on the way in which
WildcardMatcher implements IsMatch function, please refer to article
Efficient
Algorithm for String Matching with Use of Wildcard Characters.
using
System.Text.RegularExpressions;
using
System.Collections.Generic;
using
System.Text;
namespace
SysExpand.Text.StringMatching
{
public class
ExactMatcher: IStringMatcher
{
public ExactMatcher() { }
public ExactMatcher(string
pattern) { _pattern = pattern; }
public string
Pattern
{
Get {
return _pattern ?? string.Empty; }
Set { _pattern =
value; }
}
public bool
IsMatch(string value)
{
return Pattern == (value ??
string.Empty);
}
private string
_pattern;
}
public class
WildcardMatcher: IStringMatcher
{
public WildcardMatcher():
this(null, DefaultSingleWildcard, DefaultMultipleWildcard) { }
public WildcardMatcher(string
pattern): this(pattern, DefaultSingleWildcard,
DefaultMultipleWildcard) { }
public WildcardMatcher(string
pattern, char singleWildcard,
char multipleWildcard)
{
_pattern = pattern;
_singleWildcard = singleWildcard;
_multipleWildcard = multipleWildcard;
}
public string
Pattern
{
Get {
return _pattern ?? string.Empty; }
Set { _pattern =
value; }
}
public bool
IsMatch(string value)
{
int[] inputPosStack =
new int[(value.Length + 1) * (Pattern.Length
+ 1)]; // Stack containing input positions that
should be tested for further matching
int[] patternPosStack =
new int[inputPosStack.Length];
// Stack containing pattern positions that should be
tested for further matching
int stackPos =
-1;
// Points to last occupied entry in stack; -1
indicates that stack is empty
bool[,] pointTested =
new bool[value.Length
+ 1, Pattern.Length + 1]; // Each true value
indicates that input position vs. pattern position has been tested
int inputPos = 0;
// Position in input matched up to the first multiple
wildcard in pattern
int patternPos = 0;
// Position in pattern matched up to
the first multiple wildcard in pattern
if (_pattern == null)
_pattern = string.Empty;
// Match beginning of the string until first multiple
wildcard in pattern
while (inputPos <
value.Length && patternPos < Pattern.Length && Pattern[patternPos] !=
MultipleWildcard && (value[inputPos] == Pattern[patternPos] ||
Pattern[patternPos] == SingleWildcard))
{
inputPos++;
patternPos++;
}
// Push this position to stack if it points to end of
pattern or to a general wildcard character
if (patternPos == _pattern.Length
|| _pattern[patternPos] == _multipleWildcard)
{
pointTested[inputPos, patternPos] =
true;
inputPosStack[++stackPos] = inputPos;
patternPosStack[stackPos] = patternPos;
}
bool matched = false;
// Repeat matching until either string is matched
against the pattern or no more parts remain on stack to test
while (stackPos >= 0 &&
!matched)
{
inputPos =
inputPosStack[stackPos]; // Pop input and
pattern positions from stack
patternPos = patternPosStack[stackPos--];
// Matching will succeed if rest of the input string
matches rest of the pattern
if (inputPos == value.Length &&
patternPos == Pattern.Length)
matched = true;
// Reached end of both pattern and input string, hence
matching is successful
else
{
// First character in next pattern
block is guaranteed to be multiple wildcard
// So
skip it and search for all matches in value string until next multiple wildcard
character is reached in pattern
for (int
curInputStart = inputPos; curInputStart < value.Length; curInputStart++)
{
int curInputPos = curInputStart;
int curPatternPos = patternPos + 1;
if (curPatternPos == Pattern.Length)
{ // Pattern ends with
multiple wildcard, hence rest of the input string is matched with that character
curInputPos = value.Length;
}
else
{
while (curInputPos < value.Length &&
curPatternPos < Pattern.Length && Pattern[curPatternPos] != MultipleWildcard &&
(value[curInputPos] == Pattern[curPatternPos] ||
Pattern[curPatternPos] == SingleWildcard))
{
curInputPos++;
curPatternPos++;
}
}
// If we have reached next multiple wildcard character in pattern without
breaking the matching sequence, then we have another candidate for full match
// This candidate should be pushed to stack for further processing
// At the same time, pair (input position, pattern position) will be
marked as tested, so that it will not be pushed to stack later again
if (((curPatternPos == Pattern.Length && curInputPos ==
value.Length) || (curPatternPos < Pattern.Length && Pattern[curPatternPos] ==
MultipleWildcard))
&& !pointTested[curInputPos, curPatternPos])
{
pointTested[curInputPos, curPatternPos] =
true;
inputPosStack[++stackPos] = curInputPos;
patternPosStack[stackPos] = curPatternPos;
}
}
}
}
return matched;
}
public char
SingleWildcard
{
Get {
return _singleWildcard; }
Set { _singleWildcard =
value; }
}
public char
MultipleWildcard
{
Get {
return _multipleWildcard; }
Set { _multipleWildcard =
value; }
}
///
</summary>
private
string _pattern;
private char
_singleWildcard;
private char
_multipleWildcard;
public const
char DefaultSingleWildcard =
'?';
public const
char DefaultMultipleWildcard =
'*';
}
public class
RegexMatcher: IStringMatcher
{
public RegexMatcher() { }
public RegexMatcher(string
pattern) { _pattern = pattern; }
public string
Pattern
{
Get {
return _pattern ?? string.Empty; }
Set { _pattern =
value; }
}
public bool
IsMatch(string value)
{
if (string.IsNullOrEmpty(_pattern))
throw
new System.InvalidOperationException("Cannot
match strings before regular expression pattern is set.");
Regex reg =
new Regex(_pattern);
return reg.IsMatch(value);
}
private string
_pattern;
}
public class
CombinedStringMatcher: IStringMatcher
{
public CombinedStringMatcher() {
_matchers = new List<IStringMatcher>();
}
public CombinedStringMatcher(IStringMatcher[]
matchers): this() { AddRange(matchers); }
public void Add(IStringMatcher
matcher)
{
if (matcher == null)
throw
new System.ArgumentNullException("Cannot
add null reference string matcher.");
_matchers.Add(matcher);
}
public void
AddRange(IStringMatcher[] matchers)
{
if (matchers == null)
throw
new System.ArgumentNullException("Array
of matchers cannot be null.");
for (int i = 0;
i < matchers.Length; i++)
Add(matchers[i]); // May throw new
System.ArgumentNullException
}
public string
Pattern
{
get
{
StringBuilder sb =
new StringBuilder();
bool first =
true;
foreach (IStringMatcher
matcher in _matchers)
{
sb.AppendFormat("{0}({1})",
first ? string.Empty :
" OR ", matcher.Pattern);
first = false;
}
return sb.ToString();
}
set
{
throw
new System.InvalidOperationException("Cannot
set pattern on CombinedStringMatcher.");
}
}
public bool
IsMatch(string value)
{
bool
matched = false;
foreach (IStringMatcher
matcher in _matchers)
if (matcher.IsMatch(value))
{
matched = true;
break;
}
return matched;
}
List<IStringMatcher>
_matchers;
}
}
How to use
In this section we will give a short example which shows how to use string
matching classes. Code in the example lists all files from a given directory
that match one of several name patterns.
using
System;
using
SysExpand.Text.StringMatching;
using System.IO;
namespace
ConsoleApplication
{
class Program
{
static
void ListFiles(string
directory, IStringMatcher matcher)
{
DirectoryInfo dir =
new DirectoryInfo(directory);
FileInfo[] files = dir.GetFiles();
int count = 0;
long size = 0;
Console.WriteLine("{0}
{1}", matcher.GetType().Name, matcher.Pattern);
Console.WriteLine("----------------------------------");
for (int i = 0;
i < files.Length; i++)
if (matcher.IsMatch(files[i].Name.ToLower()))
{
Console.WriteLine(files[i].Name);
count++;
size += files[i].Length;
}
Console.WriteLine();
Console.WriteLine("{0}
file(s) ({1} total)", count, files.Length);
Console.WriteLine("{0}
byte(s)", size);
Console.WriteLine("----------------------------------");
Console.WriteLine();
}
static void
Main(string[] args)
{
string path =
Environment.GetFolderPath(Environment.SpecialFolder.System);
IStringMatcher matcher =
new
CombinedStringMatcher(
new IStringMatcher[]
{
new
ExactMatcher("notepad.exe"),
new
WildcardMatcher("??ta*.*"),
new
RegexMatcher(".*\\.(com|ico)")
}
);
ListFiles(path,
new WildcardMatcher("sys*.exe"));
ListFiles(path, matcher);
Console.ReadLine();
}
}
}
This example relies on the ListFiles function which receives directory path and
string matcher. Function lists all files from the directory that are matched
with specified matcher. Note that matching algorithm is not known to this
function due to encapsulation provided by IStringMatcher interface.
The fact that ListFiles operates on any string matching algorithm is then used
to list files in two distinct ways: first all executables having name starting
with "sys" are listed; in second path complex pattern is created so that
notepad.exe, files having "ta" on positions 2 and 3 in the name and all *.com,
*.ico files are listed.
Output of the program may look like this:
WildcardMatcher sys*.exe
----------------------------------
syskey.exe
systeminfo.exe
SystemPropertiesAdvanced.exe
SystemPropertiesComputerName.exe
SystemPropertiesDataExecutionPrevention.exe
SystemPropertiesHardware.exe
SystemPropertiesPerformance.exe
SystemPropertiesProtection.exe
SystemPropertiesRemote.exe
systray.exe
10 file(s) (2281 total)
686080 byte(s)
----------------------------------
CombinedStringMatcher (notepad.exe) OR (??ta*.*) OR (.*\.(com|ico))
----------------------------------
chcp.com
dataclen.dll
diskcomp.com
diskcopy.com
format.com
mode.com
more.com
mstask.dll
netapi32.dll
notepad.exe
PerfCenterCpl.ico
tree.com
12 file(s) (2281 total)
715328 byte(s)
----------------------------------
Conclusion
In this article we have provided full solution to strings matching problem when
application must be able to change the string matching rules at run time. Using
these classes is quite simple. They allow application to be coded in terms of
IStringMatcher interface on all places where string matching functionality is
required. Particular implementation of string matcher can then be selected at
run time without affecting the code which relies on string matching outcome.
Once again, please refer to the library home page http://www.sysexpand.com/?path=net/sysexpand/text/strmatch/manual/intro at which latest version of code and documentation for this library are available.