I made a comment yesterday on an answer where someone had used [0123456789]
in a regular expression rather than [0-9]
or \d
. I said it was probably more efficient to use a range or digit specifier than a character set.
I decided to test that out today and found out to my surprise that (in the C# regex engine at least) \d
appears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:
Regular expression \d took 00:00:00.2141226 result: 5077/10000 Regular expression [0-9] took 00:00:00.1357972 result: 5077/10000 63.42 % of first Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000 64.87 % of first
It's a surprise to me for two reasons:
- I would have thought the range would be implemented much more efficiently than the set.
- I can't understand why
\d
is worse than[0-9]
. Is there more to\d
than simply shorthand for[0-9]
?
Here is the test code:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics; using System.Text.RegularExpressions; namespace SO_RegexPerformance { class Program { static void Main(string[] args) { var rand = new Random(1234); var strings = new List(); //10K random strings for (var i = 0; i strings, string regex) { var sw = new Stopwatch(); int successes = 0; var rex = new Regex(regex); sw.Start(); foreach (var str in strings) { if (rex.Match(str).Success) { successes++; } } sw.Stop(); Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count); return sw.Elapsed; } } }