Regular expression to find and remove duplicate words

前端未结

关注

 9  937

Using regular expressions in C#, is there any way to find and remove duplicate words or symbols in a string containing a variety of words and symbols?

Ex.

相关标签:

9条回答

轻奢々

2020-11-30 09:52

You won't be able to use regular expressions for this problem, because regex only matches regular languages. The pattern you are trying to match is context-sensitive, and therefore not "regular."

Fortunately, it is easy enough to write a parser. Have a look at Per Erik Stendahl's code.

0 讨论(0)

发布评论:

提交评论

加载中...

旧时难觅i

2020-11-30 09:54

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

See When not to use Regex in C# (or Java, C++ etc)

Of course using a regex to split the string into words may be a useful first step, however String.Split() is clear and it lickly to do everything you need.

0 讨论(0)

发布评论:

提交评论

加载中...

孤街浪徒

2020-11-30 09:56

As said by others, you need more than a regex to keep track of words:

var words = new HashSet<string>(); string text = "I like the environment. The environment is good."; text = Regex.Replace(text, "\\w+", m => words.Add(m.Value.ToUpperInvariant()) ? m.Value : String.Empty);

0 讨论(0)

发布评论:

提交评论

加载中...

一个人的身影

2020-11-30 10:00

Well, Jeff has shown me how to use the magic of in-expression backreferences and the global modifier to make this one happen, so my original answer is inoperative. You should all go vote for Jeff's answer. However, for posterity I'll note that there's a tricky little regex engine sensitivity issue in this one, and if you were using Perl-flavored regex, you would need to do this:

\b(\S+)\b(?=.*\b\1\b.*)

instead of Jeff's answer, because C#'s regex will effectively capture \b in \1 but PCRE will not.

0 讨论(0)

发布评论:

提交评论

加载中...

轮回少年

2020-11-30 10:00

Regex is not suited for everything. Something like your problem does fall into that category. I would advise you to use a parser instead.

0 讨论(0)

发布评论:

提交评论

加载中...

清酒与你

2020-11-30 10:01

Regular expressions would be a poor choice of "tools" to solve this problem. Perhaps the following could work:

HashSet<string> corpus = new HashSet<string>(); char[] split = new char[] { ' ', '\t', '\r', '\n', '.', ';', ',', ':', ... }; foreach (string line in inputLines) { string[] parts = line.Split(split, StringSplitOptions.RemoveEmptyEntries); foreach (string part in parts) { corpus.Add(part.ToUpperInvariant()); } } // 'corpus' now contains all of the unique tokens

EDIT: This is me making a big assumption that you're "lexing" for some sort of analysis like searching.

0 讨论(0)

发布评论:

提交评论

加载中...

1 2 下一页

验证码

看不清?

提交回复