.NET Regular Expressions Demystified - Part One

Definition
 
In very simple terms, we can say that "a regular expression is a group of characters that defines a pattern," and using that pattern we find out specific information that we require in our case.
 
Hence, regular expressions are nothing but a group of characters that have special meanings to the regular expression engine, which is already installed in the .NET framework and represented by System.Text.RegularExpressions.Regex.
 
Goal of Article
 
My goal in this article is to give you a basic understanding of regular expressions in a very short span of time. I will guide you so that you can create and use your own regular expressions in your .NET applications to meet your needs. After reading this article, you will be able to create your own regular expressions to match a standard phone number, social security number, Email address, postal code, etc.
 
Prerequisites
 
You must have a basic understanding of C# Language and some basic concepts of OOPS (object-oriented programming).
 
Let’s Jump into Regular Expressions
 
Before I start discussing the syntax and the regular expression engine, let me answer some questions that you might have in your mind.
 
 
You might be confused like these guys
 
If you are confused and thinking regular expressions are hard to learn then trust me, regular expressions are not that bad, in fact, it's really very easy to learn and use them. Once you are up and running with the regular expressions, I believe you will be doing more cool and fun stuff with Regex and will be fully utilizing the power of the .NET regular expressions engine.
 
What Regular Expressions Are
 
As I told you earlier, regular expressions are nothing but a string of special characters that define a pattern, that further will be used in our C# program to extract out specific information from a large block of the text.
 
Why We Need Regular Expressions
 
As a .NET developer, we work on different types of Applications like Web, mobile, or desktop apps. Now, one thing common in all types of applications is taking input from the users. The users might intentionally or unintentionally put in the wrong input. Now, it’s our duty as a developer to validate that input, before we process the information and store that into the database. The wrong input might crash our applications. Hence, we need regular expressions to validate that input. This was only one reason, but they're so many reasons why we use regular expressions. It's because after all, regular expressions provide us with a powerful and fast way to manipulate and parse the text. There is a plus point for us because the regular expression syntax is the same for all types of .NET applications.
 
 
Uses of Regular Expressions
 
There are many practical uses of regular expressions, but let me tell you some common ones, which are:
  • The regular expressions can be used to manipulate and validate user inputs.
  • The regular expressions can be used to replace, remove, and pull out the values from the text input.
  • You can use the regular expressions to parse HTML documents to take out some specific data to store in the database.
  • The regular expressions might be used to find out specific words or sentences in a large document, instead of reading the whole document.
The following example will give you a basic visual understanding of regular expressions. This actually happens when your Regex exactly returns what you want.
 
 
Now Let’s Start Learning and Practicing
 
The best way to learn anything in the world is to start practicing and getting your hands dirty with it before you completely learn it, and in the end, there is always something more to learn.
 
How Regular Expressions Work
 
Regular expressions are used to process text-based data on the regular expression engine, i.e. already installed in .NET Framework, and are represented by System.Text.RegularExpressions.Regex.
 
The regular expression engine needs only two things to process the text.
  1. The regular expression pattern that you defined to find text. (Don’t worry, later in this article, we will learn the syntax of the regular expression).
     
  2. The second thing is the input text that we need to parse.
     
Basic Syntax
 
Now, it’s time to learn the basic syntax of .NET regular expressions, so that we can create and use them in our C# programs.
 
Special Characters
 
As I told you earlier, regular expressions are a group of special characters with special meanings.
 
There are some mostly used special characters listed below in the table that I referenced from MSDN.
 
Special Characters
\b Represents the position at the beginning and end of the word.
\d Represents any digit character.
\t Represents a backspace character.
\n Represent new line character.
\s Represents any white space character.
. Represents every character on the same line.
\w Represents any non-digit alphanumeric character.
^ Matches position at beginning of the whole string.
$ Represents position at the end of the whole string.
 
Before we see some more special characters, let me explain some simple examples where we will use the above characters so that you can feel more comfortable. Before proceeding, you need a basic understanding of the Regex class and its methods.
 
Basic Understanding of Regex Class
 
Regex is a standard C# .NET class, i.e. used to represent the regular expressions in .NET. We can easily say that Regex is used to represent an immutable regular expression. It’s because later we will see that Regex actually accepts a regular expression value in form of a string. String class is an immutable class in .NET. Immutable means once we set a value to string the object, later, we can’t change that value. To learn more about the string class and its immutable nature, you can click here.
 
Now, we know the Regex class represents a regular expression, and to use that class in our program, we need to create an instance of it so that we can find the matches and to do more crazy stuff with our text inputs. To create an instance of the Regex class, we will use one of its Constructors, which will take the regular expression pattern string as an argument.
  1. Regex regex = new Regex(@"\bimportant\b");  
Methods of Regex
 
Here, I will explain some of the most commonly used Regex class methods.
 
Method Description
IsMatch(String) That particular method will return a Boolean value true or false that will represent whether or not the regular expression specified in the Regex instance will find the match in the input text. True means that matches found and false means match not found.
Matches(String) That method finds all the matches based on regular expression specified and returns the matches in form of MatchCollection object
Replace(string,replacementString) That particular method replaces all the matches based on regular expression specified with a specific replacement string.
 
If you are interested in learning more about the Regex class and want to explore all of its Constructors, the properties and methods can be clicked at MSDN.
 
Explaining Some Simple Expressions with Examples
  1. “Important:” literally speaking it will find ‘important’ as it is
     
    The regular expression pattern, which is a very simple form of the regular expression, will find the match for 9 words ‘I’, ’m’, ’p’, ’o’, ’r’, ’t’, ’a’, ’n’, ’t’ in the exact same sequence as they are written above. If there are some characters before and after the sequence other than space, inappropriately, it will find those matches too, words like unimportant, very-important and important, etc.
     
     
    We saw in the above example the weak point of our expression. Now, let’s improve our expression. Therefore, we can get what we actually want.
     
  2. “\bimportant\b” now it will find ‘important’ as a whole word
     
    Now we have improved our expression by adding ‘\b’ before and after it. As you are already familiar with ‘\b’ from the above table,‘\b’ is a special character that tells the regular expressions engine to please start finding a match for that particular expression at the beginning of the word and stops at the ending of the word. In simple terms ‘\b’ represents the position at the beginning and end of the word.
    1. ```csharp  
    2. class JustFind  
    3. {  
    4.    static void Main(string[] args)  
    5.    {  
    6.       string pattern = @"\bimportant\b";  
    7.       string inputString = "Some important text to find unimportant stuff";  
    8.       Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);  
    9.       MatchCollection matches=regex.Matches(inputString);  
    10.       Console.WriteLine("\tAll Matches");  
    11.       foreach (var match in matches)  
    12.       {  
    13.          Console.WriteLine(match);  
    14.       }  
    15.       Console.ReadLine();  
    16.    }  
    17. }  
    18. ```  
    Result
     
     
    Now, you can see from the snapshot given above, we got only one result back.
     
  3. Example of ‘\s’ character
     
    Here, we will explain an example, where we will use ‘\s’ character to explain the purpose and use of that particular character. As I mentioned above in the special characters table, that ‘\s’ character is used to represent a white space character in the text. With the use of ‘\s’, we will create a regular expression, that will help us to replace the spaces between the words with ’_’ character.
    1. ```csharp  
    2. class JustFind  
    3. {  
    4.    static void Main(string[] args)  
    5.    {  
    6.       string pattern = @"\s";  
    7.       string inputString = "Some important text to find unimportant stuff";  
    8.       Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);  
    9.       string result = regex.Replace(inputString, "_");  
    10.       Console.WriteLine(result);  
    11.       Console.ReadLine();  
    12.    }  
    13. }  
    14. ``` 
    Result
     
     
  4. Finding Number Words in string ("\b\w+\b").
     
    In this example, we will write an expression that’s going to help us to find the number of words in a particular text input. In any text or document, the words are separated by a space character, so in that case, the space character will help us to find our words. The above will skip the spaces b/w words and will pick up every word that starts and ends with any alphanumeric character and must have 1 or more characters inside.
     
    Expression Description
    \b means start with
    \b\w means start with any alphanumeric character
    \b\w+ means start with any alphanumeric character and repeats the previous match 1 or more times (in simple terms it means the word we are going to match must contain at least one character)
    \b\w +\b In the end \b means the word also must end with an alphanumeric character
    1. ``  
    2. `csharp  
    3. class JustFind  
    4. {  
    5.     static void Main(string[] args)  
    6.     {  
    7.         string pattern = @"\b\w+\b";  
    8.         string inputString = "Some important text to find unimportant stuff";  
    9.         Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);  
    10.         inttotalWords = regex.Matches(inputString).Count;  
    11.         Console.WriteLine($"Words Count: {totalWords}");  
    12.         Console.ReadLine();  
    13.     }  
    14. }  
    15. `  
    16. ``  
    Result
     

Summary

 
This part was a basic introduction to regular expressions in .NET. Next in this series: .NET Regular Expressions Demystified Part 2