Regular expressions are a powerful tool in the programming world, allowing developers to efficiently search, parse, and manipulate text. However, they can also be a source of frustration, as the syntax and patterns can be complex and difficult to remember. In this article, we will explore the various ways to use regular expressions and provide practical examples to help demystify this topic. Additionally, we will share tips for optimizing regular expression performance based on the latest best practices. If you want to improve your skills in this area, check out my book "Rock Your Code: Code & App Performance for Microsoft.NET" available on Amazon.com.
If you’ve never used a regular expression, this is a description:
A regular expression, often abbreviated as "regex" or "regexp", is a sequence of characters that define a search pattern. Regular expressions are used to search for and match patterns in text and manipulate text based on those patterns. They can be used in a variety of programming languages, tools, and applications to perform tasks such as validation, data extraction, and text manipulation. Regular expressions are made up of a combination of characters, including literal characters, metacharacters, and quantifiers, which define the rules for the pattern to be matched.
The concept of regular expressions can be traced back to the 1940s when mathematician Stephen Kleene introduced the concept of regular sets and regular languages. In the 1960s, Ken Thompson, a computer scientist at Bell Labs, developed the first implementation of regular expressions as part of the QED text editor. The regular expression syntax was later standardized and popularized in the Unix world by tools such as grep, sed, and awk.
In the Beginning
Using regular expressions has been in .NET ever since the first version. The first part of a regular expression is to come up with the pattern that will be used for matching or replacing strings. For example, this is the pattern that I used to ensure a string contains a word.
\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
This is how we use that pattern in code.
public static bool ContainsWord(string input){
var expression = new Regex(
pattern: @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*",
options: RegexOptions.CultureInvariant);
return expression.IsMatch(input);
}
Then Came a Better Way
Regular expressions are fast but recently another way came along in .NET to use them to increase performance. The “magic” is that now we use them from a field in a class. First, we move the call to a field like this.
private static readonly Regex _containsWordRegEx =
new(pattern: @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*",
options: RegexOptions.CultureInvariant);
Then we use the field instead of creating a new Regex object whenever needed.
public static bool ContainsWord(string input) =>
_containsWordRegEx.IsMatch(input);
As my code performance book described, this dramatically increases the performance (see benchmark results below).
Using the Regex Source Generator in .NET 7!
Source generators were introduced in .NET 5. With the release of .NET 7, the team added a source generator to increase performance even more for regular expressions. They must be used in a partial class along with the GeneratedRegex attribute.
public static partial class RegexExamples{
[GeneratedRegex(@"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*",
RegexOptions.CultureInvariant)]
private static partial Regex ContainsWordRegex();
}
This pattern along with the generator produces almost 1,000 lines of code! I can’t show it here, but it’s all the code you could have written manually. I doubt any manager would give you the time to write and test this code! Since I like good documentation, it even adds helpful info for Intellisence.
Then I created a method to use the source generator code.
public static bool ContainsWord(string input) =>
ContainsWordRegex().IsMatch(input);
Benchmark Tests
But is using a source generator faster? Well, let’s look at the performance for all three of these ways to use a regular expression.
As you can see, using the source generator is 7 times faster than using a field and over 35 times faster than the normal way of coding regular expressions! Also, using the generator or field allocates zero bytes in memory while the normal way allocates 6,696 bytes.
Now let’s look at the performance using a regular expression to find spaces in a string using "\s+" as the pattern so the spaces can be replaced.
This shows that using the source generator is 1.63 times faster than the field and 1.85 times faster than using the normal way. The generator and field allocate 1,960 bytes in memory while the normal way allocates 4,536 bytes.
After reading this, are you going to refactor all your code that uses regular expressions?
Caution
I'd like to share a few quarks when using this generator. I have found using the RegexOptions.Compiled option wipes out almost all the performance gain with the generator. After looking at the code it generates, it’s using that option anyway.
The second thing I found while working on the source generator method, I kept seeing messages like this one that state, “Partial method must have an implementation part because it has accessibility modifiers.”.
I kept thinking something was wrong, but the code needs to be regenerated. Just choose Build – Clean, then Build. That will clear it up. Not sure if all generators do this, but this one does.
Summary
To summarize, optimizing regular expressions in .NET 7 can significantly improve the performance of your code. By following the tips mentioned in this article, you can ensure that your regular expressions are processed efficiently and avoid potential performance bottlenecks. Remember to benchmark your code to ensure you are getting the most out of your optimizations.
Do you have any experiences or questions related to optimizing performance in .NET? Please share in the comments below, I'd love to hear from you.