可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This question already has an answer here:
This is the code:
StringBuilder sb = new StringBuilder(); Regex rgx = new Regex("[^a-zA-Z0-9 -]"); var words = Regex.Split(textBox1.Text, @"(?=(?<=[^\s])\s+\w)"); for (int i = 0; i < words.Length; i++) { words[i] = rgx.Replace(words[i], ""); }
When im doing the Regex.Split()
the words contain also strings with chars inside for exmaple:
Daniel>
or
Hello:
or
\r\nNew
or
hello---------------------------
And i need to get only the words without all the signs
So i tried to use this loop but i end that in words there are many places with ""
And some places with only ------------------------
And i cant use this as strings later in my code.
回答1:
You don't need a regex to clear non-letters. This will remove all non-unicode letters.
public string RemoveNonUnicodeLetters(string input) { StringBuilder sb = new StringBuilder(); foreach(char c in input) { if(Char.IsLetter(c)) sb.Append(c); } return sb.ToString(); }
Alternatively, if you only want to allow Latin letters, you can use this
public string RemoveNonLatinLetters(string input) { StringBuilder sb = new StringBuilder(); foreach(char c in input) { if(c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') sb.Append(c); } return sb.ToString(); }
Benchmark vs Regex
public static string RemoveNonUnicodeLetters(string input) { StringBuilder sb = new StringBuilder(); foreach (char c in input) { if (Char.IsLetter(c)) sb.Append(c); } return sb.ToString(); } static readonly Regex nonUnicodeRx = new Regex("\\P{L}"); public static string RemoveNonUnicodeLetters2(string input) { return nonUnicodeRx.Replace(input, ""); } static void Main(string[] args) { Stopwatch sw = new Stopwatch(); StringBuilder sb = new StringBuilder(); //generate guids as input for (int j = 0; j < 1000; j++) { sb.Append(Guid.NewGuid().ToString()); } string input = sb.ToString(); sw.Start(); for (int i = 0; i < 1000; i++) { RemoveNonUnicodeLetters(input); } sw.Stop(); Console.WriteLine("SM: " + sw.ElapsedMilliseconds); sw.Restart(); for (int i = 0; i < 1000; i++) { RemoveNonUnicodeLetters2(input); } sw.Stop(); Console.WriteLine("RX: " + sw.ElapsedMilliseconds); }
Output (SM = String Manipulation, RX = Regex)
SM: 581 RX: 9882 SM: 545 RX: 9557 SM: 664 RX: 10196
回答2:
do consider it. But as I’ve argued in the comments, regular expressions are actually the correct tool for the job, you’re just making it unnecessarily complicated. The actual solution is a one-liner:
var result = Regex.Replace(input, "\\P{L}", "");
\P{…}
specifies a Unicode character class we do not want to match (the opposite of \p{…}
). L
is the Unicode character class for letters.
Of course it makes sense to encapsulate this into a method, as keyboardP did. To avoid recompiling the regular expression over again, you should also consider pulling the regex creation out of the actual code (although this probably won’t give a big impact on performance):
static readonly Regex nonUnicodeRx = new Regex("\\P{L}"); public static string RemoveNonUnicodeLetters(string input) { return nonUnicodeRx.Replace(input, ""); }
回答3:
To help Konrad and keyboardP resolve their differences, I ran a benchmark test, using their code. It turns out that keyboardP's code is 10x faster than Konrad's code
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Text.RegularExpressions; namespace ConsoleApplication1 { class Program { static void Main(string[] args) { string input = "asdf234!@#*advfk234098awfdasdfq9823fna943"; DateTime start = DateTime.Now; for (int i = 0; i < 100000; i++) { RemoveNonUnicodeLetters(input); } Console.WriteLine(DateTime.Now.Subtract(start).TotalSeconds); start = DateTime.Now; for (int i = 0; i < 100000; i++) { RemoveNonUnicodeLetters2(input); } Console.WriteLine(DateTime.Now.Subtract(start).TotalSeconds); } public static string RemoveNonUnicodeLetters(string input) { StringBuilder sb = new StringBuilder(); foreach (char c in input) { if (Char.IsLetter(c)) sb.Append(c); } return sb.ToString(); } public static string RemoveNonUnicodeLetters2(string input) { var result = Regex.Replace(input, "\\P{L}", ""); return result; } } }
I got
0.12 1.2
as output
UPDATE:
To see if it is the Regex compilation that is slowing down the Regex method, I put the regex in a static variable that is only constructed once.
static Regex rex = new Regex("\\P{L}"); public static string RemoveNonUnicodeLetters2(string input) { var result = rex.Replace(input,m => ""); return result; }
But this had no effect on the runtime.