How does String.Contains work? [duplicate]

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

Possible Duplicate:
What algorithm .Net use for searching a pattern in a string?

I have a loop in my program that gets a line from a file. Then there is a check to whether the line contains a string

if(line.Contains("String")) {     //Do other stuff }

There are over 2 million rows in the file so if I can quicken the speed by even 1/10th millisecond then this would save me over 3 minutes on each run.

So... Say a line is 1000 chars long, is it quicker to look for a short or long string, or does it not make a difference?

line.Contains("ABCDEFGHIJKLMNOPQRSTUVWXYZ");

or

line.Contains("ABCDEFG")

Thank you in advance.

回答1:

String.Contains() follows a torturous route through System.Globalization.CompareInfo into the CLR and the NLS support sub-system where I thoroughly got lost. This contains highly optimized code with impressive perf. The only way to do it faster is through pinvoking the standard CRT function wcsstr, available in msvcrt.dll

    [DllImport("msvcrt.dll", CharSet = CharSet.Unicode)]     private static extern IntPtr wcsstr(string toSearch, string toFind);

It returns IntPtr.Zero if the string wasn't found. I made some measurements, using String.IndexOf() instead of Contains() to test the various string comparison options. All times are in nanoseconds, searching for a string of 7 characters in a string of 60 characters. With the string not present so worst case is measured. The lowest time out of a sample of 20 was used:

StringComparison.Ordinal (same as Contains) : 245 nanoseconds StringComparison.OrdinalIgnoreCase : 327 StringComparison.InvariantCulture : 251 StringComparison.InvariantCultureIgnoreCase : 327 StringComparison.CurrentCulture : 275 StringComparison.CurrentCultureIgnoreCase : 340 wcsstr : 213

Very impressive numbers and on par with what you'd expect these functions to require. The wcsstr() function does the same kind of ordinal comparison as String.Compare(). It is only 13% faster, a statistically insignificant improvement given that real perf is not likely to approach these measurements due to the effects of CPU cache locality. I can only conclude that you're going about as fast as you can expect. Whether the slight improvement from wcsstr is worth it is up to you.



回答2:

"Generally the algorithm gets faster as the key being searched for becomes longer"

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm



回答3:

If .NET String.Contains uses the Boyer-Moore algorithm it is faster to search a longer string.

May I suggest that if you are reading a delimited file so for example each line is a series of text fields separated by commas that you could save some searching time by not searching from position 0 in the line if you know it can't be before character 40 for example.

It is quite common to split a line using the String Spilt function on a delimiter character this returns an array of Strings. Then you would search only within the field where the value could appear. This will be quicker too.



回答4:

When it comes to increasing performance, you want to make sure that what you intend to improve is actually the cause of the bottleneck. I would start by profiling the performance of the various pieces in my code and finding which piece introduces the greatest bottleneck.

At first blush, I see three main components of this process that are worth profiling:

  • Loop through the entire contents of the file (this may or may not include File IO considerations, which you may wish to test separately).
  • Check String.Contains() for each line in the file.
  • "Do Stuff" when there is a match

While there are great commercial profilers out there, you can start by using a simple timer (like System.Diagnostics.Stopwatch) in your code, or, if the process is really long, simply use watch to measure time.

Measure the time of each of the following.

  1. Measure the time it takes simply to look through the entire file without doing anything else. This isolates your looping and IO code so you can see how well it performs.
  2. Next, add only the string comparison (String.Contains) and measure the total time it takes with this added.
  3. Lastly, add in the "do stuff" code and measure the total time with this added.

You can now compare the cost of adding in each component to figure out how expensive that component is and whether or not you need to improve it. For example, let's say you recorded the following times:

Test           Total Time   Cost (Difference) ============================================= Do Nothing     0s           0s Loop Only      100s         100s Add Comparison 105s         5s Add Do Stuff   130s         25s

Looking at those (fake) numbers, the most expensive part of the process is the loop and IO code, so that's where I would begin to try to increase performance. Since the "Do Stuff" part added 25 seconds to the overall execution time, I would take a look at that code next to see if there was anything I could improve. Lastly, I would look at string comparison.

Of course, the time measurements I provided are fictitious, but you can apply the same steps to your performance profiling in your application. The key is to identify the greatest costs and try to reduce those first.



回答5:

To my surprise, a compiled Regex is actually much, much faster than Contains. Of course, the cost of compiling is only worth it if you search for the same string multiple times. But if so, it's more than two times faster. Try it yourself:

static void Test() {   var random = new Random(10);    var alphabet = "abcdefghijklmnopqrstuvwxyz";   var content = new String((from x in Enumerable.Range(0, 10000000)                             select a[random.Next(0, a.Length)]).ToArray());    var searchString = content.Substring(5000000, 4096);    var regex = new Regex(searchString);    var sw = Stopwatch.StartNew();   for (int i = 0; i < 1000; i++)   {     if (!regex.IsMatch(content))     {       throw new Exception();     }   }    sw.Stop();   Console.WriteLine("Regex: " + sw.Elapsed);   sw.Restart();    for (int i = 0; i < 1000; i++)   {     if (!content.Contains(searchString))     {       throw new Exception();     }   }   sw.Stop();   Console.WriteLine("String.Contains: " + sw.Elapsed);  }

I still haven't figured out how it can be this fast, looking at the compiled assembly, it's an obfuscated mess of switch statements. But it's fast, really fast.



回答6:

The String.Contains method returns true if and only if this string contains the specified sequence of char values.

It is basically used to check sub string within the string.

http://csharp.net-informations.com/string/csharp-string-contains.htm



回答7:

Contains, like all the string searching methods (I think), eventually leads to the external method InternalFindNLSStringEx - whatever that does, it's fairly ungoogleable.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!