NotePad.NET II - Find and Replace Inside a Document Using Regular Expressions

Introduction

In my fervent quest to improve upon notepad with notepad.net, I've added a feature to the tool to allow you to search using regular expression searches. Regular expressions are very useful for matching patterns in your text that span more than just matching exact letters. The example in figure 1 matches all words beginning with "acc". If you have never worked with regular expressions before, table 1 below shows you some of the regular expression symbols and their respective meaning.

FindReplace.jpg 

Figure 1 - Notepad.NET's FindDialog using Regular Expressions

Expression Description Example
. Match any single character except new line character (wildcard) c.t   matches cat and cut , cit, cpt, c&t ,  etc. 
c..t   matches curt, cant
[ ] Matches a single character of  a set of characters c[au]t  only matches cat and cut
[ ]+ Matches one or more characters in a set of characters c[a]+t  matches cat and caat and caaat,        
c[a-z]+t   matches cat, cut, clout, cakajhknmht
[0-9]+  matches integer values   
[0-9a-z]+
  matches alphanumeric expressions
(note that for a set of one character, you could have also expressed c[a]+t as ca+t)
[ ]* Matches zero or more characters in a set of characters c[a]*t  matches ct, cat and caat and caaat,  etc.    
c[a-z]*t  
matches ct, cat, cut, clout, creawtt
(note that for a set of one character, you could have also expressed c[a]*t as ca*t)
[ ]?  Matches zero or one time in a set of characters,
 used for optional characters
c[a]?t  matches ct, cat        
c[a-z]?t  
matches ct, cat, cut, cot, crt, etc.
(note that for a set of one character, you could have also expressed c[a]?t as ca?t)
\ escape character.  used for special characters and for overriding regular expressions so they are recognized as literals \n  line feed 
\d
  digit, same as [0-9]
\w
 alphanumeric character, same as [0-9a-z]
\s
  space or lf character, same as [\t\n\r\f]
\.
   period, must be escaped, since otherwise it would mean wildcard character
\-  dash must also be escaped, since otherwise it mean range of characters
{n} character must appear exactly n times ca{3}t  matches only caaat   
c[a-z]{3}t   matches  cabct,  caaat,  coggt , cxyzt
{n,m} character must appear  between n and m times ca{0,3}t matches only ct, cat, caat, caaat.
(  ) used to apply regular expression operations to more than one consecutive character in an expression c(at)+  matches cat, catat, catatatatat,   
c(at){2}   matches only catat
^ Matches a character at the beginning of the line ^T.+    matches all lines in the searchable string beginning with the letter T.  e.g. matches the line  Thanks for the coffee!   (as long as Thanks is at the beginning of a line)
Note: the meaning of ^ depends on whether the RegexOptions are set for Singleline or Multiline
$ Matches a character at the end of the line ^T.+P$   Matches all lines beginning with T and ending with P. 
Note: the meaning of $ depends on whether the RegexOptions are set for Singleline or Multiline

Table 1 - Regular Expression Symbols, and how to use them.

So let's look at a few useful regular expressions.  Say we wanted to find all social securities in a text document matching the form

ddd-dd-dddd

We need a regular expression that only allows numeric digits  in groups of 3-2-4 including the hyphen.  Using table 1, we come up with the following expression:

[0-9]{3}\-[0-9]{2}\-[0-9]{4}

This expression will only search for social security numbers matching three digits, hyphen, two digits, hyphen, 4 digits (e.g. 111-11-1111).  If we want to make the hyphen optional, we could write the expression with the question mark (?)  after each hyphen

[0-9]{3}\-?[0-9]{2}\-?[0-9]{4}

Now we can accept expressions of either 111-11-1111 or 111111111.

Perhaps we want to look for all email addresses.  We know e-mail addresses are in the form [email protected] with certain allowable characters for x (including a period).  An equivalent expression would be:

[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}

In the next section we will explore the code that allows us to search a text box for regular expressions using the rich .NET framework.

The Design

The Find Dialog with regular expressions is implemented inside the FindDialog class shown in the UML Diagram in Figure 2.  When Find Next is clicked inside the dialog (shown in figure 1)  and regular expression is checked, the regular expression code kicks in by calling the FindNext method.

 N2UML.jpg

Figure 2 - NotePad.NET reverse engineered from C# using the WithClass UML Tool

The Code

The FindNext method inside the FindDialog takes 4 parameters:  the search string we are looking for, whether or not the search string is case sensitive, the multiline text box control containing the entire text, and a boolean to indicate if the search string is a regular expression or not.  The code is broken down into an if-else statement for handling whether or not the string we are looking for is case sensitive.  Inside the if-else blocks we handle whether or not the search string is a regular expression or not.  For searching regular expressions we use the System.Text.RegularExpressions namespace.  This namespace contains the class Regex (which I guess stands for regular expression). Table 2 shows some of the members of this class to help you implement regular expression matching:

Regex Member Description
Regex (string pattern,  RegexOptions) Constructor to create a regex option. Passes regular expression to match, and options for how to treat the pattern matching engine (shown in  table 3)
Match (string input, int startLocation)

 

Matches the first occurence of the input string, to the pattern passed to the constructor of the Regex object.  Starts looking for the match at the startLocation of the input string.  Returns a Match Object that indicates success of the match, position of the match, and length of the match.
IsMatch (string input) Decides whether or not the regular expression is matched anywhere in the input string.

Table 2 - Regex Members

With the Regex class, you construct an instance passing it the regular expression pattern you wish to match  You can also pass a set of options (RegexOptions)  into the regular expression constructor, such as whether or not to ignore case on your search.  Table 3 shows a list of some of the  useful RegexOptions:

Option Description
IgnoreCase Ignore the case of the text to search
Multiline Changes the meaning of the regular expression ^ and $ to mean beginning and end of a line separated by a line feed, rather than the beginning and end of a string.
Singleline Changes the meaning of the regular expression ^ and $ to mean beginning and end of a string
RighttoLeft Strings are searched from right to left instead of left to right
IgnorePatternWhitespace Ignores white spaces that don't use escape characters in the pattern, and allows for comments in the pattern starting with the pound sign (#)

Table 3 - Some of the RegexOptions

Once you've constructed the Regex object, it's time to use it to search inside the Notepad.NET document string.  By calling the Match method of the Regex class on the search string, we can find the next occurrence of the regular expression we entered into the find dialog.  If the match is a success, we extract the position and length of the matching string so we can select it inside the text box.

Note:  When searching for strings when useRegularExpressions = false is passed into or FindNext method,   we use the IndexOf method of the string class (for case sensitive searches) and the IndexOf method of the System.Globalization.CompareInfo class (for case insensitive searches).

Listing 1 - Finds the Next Search String inside the textbox

            /// <summary>
            ///
Find the next location of the string to search inside the text box
            ///
</summary>
            /// <param name="searchString">the string to search for
</param>
            /// <param name="caseSensitive">whether or not the search is case sensitive
</param>
            /// <param name="txtControl">text box control to search
</param>
            /// <param name="useRegularExpression">whether or not we are using regular expressions
</param>
            /// <returns></returns>

            static public bool FindNext(string searchString, bool caseSensitive,
                  TextBox txtControl, bool useRegularExpression)
            {

                  // track the current search string

                  CurrentSearchString = searchString;
                  Regex regularExpression = null;

 

                  // get the length of the search string

                  int searchLength = searchString.Length;

                  // handle case sensitive strings separately

                  if (caseSensitive)
                  {
                        if (useRegularExpression)
                        {
                             
// we are using regular expressions, create a RegularExpression object
                             
try
                              {
                                    regularExpression = new Regex(searchString);
                              }
                              catch (Exception ex)
                              {
                                    MessageBox.Show("Invalid Regular expression");
                                    return false;
                              }

                              // Now match the regular expression

                              Match match = regularExpression.Match(txtControl.Text, _currentIndex);

                              // if we successfully matched, get the index location of the match inside
                              // the textbox control and the length of the match

                              if (match.Success)
                              {
                                    _currentIndex = match.Index;
                                    searchLength = match.Length;
                              }
                             
else
                              {
                                    // no match

                                    _currentIndex = -1;
                              }

                        }
                       
else
                        {
                             
// not a regular expression search, just match the literal string
                              _currentIndex = txtControl.Text.IndexOf(searchString, _currentIndex);
                        }

                  }

                  else
                  {
                       
// this section is for case-insensitive searches
                        if (useRegularExpression)
                        {
                             
try
                              {
                                   
// set the ignore case option and Multiline option for regular expressions
                                 regularExpression = new Regex(searchString,
                                        RegexOptions.IgnoreCase | RegexOptions.Multiline);
                              }
                              catch (Exception ex)
                              {
                                    MessageBox.Show("Invalid Regular expression");
                                    return false;
                              }

                              // Now match the regular expression
                              Match match = regularExpression.Match(txtControl.Text, _currentIndex);

                              // if we successfully matched, get the index location of the match inside
                              // the textbox control and the length of the match

                              if (match.Success)
                              {
                                    _currentIndex = match.Index;
                                    searchLength = match.Length;
                              }
                             
else
                              {
                                   
// no match
                                    _currentIndex = -1;
                              }

                        }
                       
else
                        {
                             
// this search is for non-regular expressions case-insensitive
                              CultureInfo culture = new CultureInfo("en-us");

                              _currentIndex = culture.CompareInfo.IndexOf(txtControl.Text, searchString,
                                  _currentIndex, System.Globalization.CompareOptions.IgnoreCase);
                        }

                  }

                  // if we found a match, select it in the multiline text box
                  if (_currentIndex >= 0)
                  {

                        // (note: this should be refactored, but is shown in one place for the sake of the
                        // article.)

                        // select the matching text
                        txtControl.SelectionStart = txtControl.Text.IndexOf("\n", _currentIndex) + 2;
                        txtControl.SelectionLength = 0;
                        txtControl.SelectionStart = _currentIndex;
                        txtControl.SelectionLength = searchLength;
                        _currentIndex += searchLength;
// advance past selection
                        txtControl.ScrollToCaret(); 
// scroll to selection
                  }
                 
else
                  {
                       
// no match, reached the end of the document
                        MessageBox.Show("Reached the end of the document.");
                        _currentIndex = 0;
                        return false;
                  }

                  return true;
            }

Conclusions

Originally common place on the UNIX platform, regular expressions are a useful tool for searching text patterns in an editing tool.  Using Notepad.NET you can now do the same intricate searches like other complex ascii editors.  Stay tuned for the next advanced feature in NotePad.NET...