Using regular expressions in C#, is there any way to find and remove duplicate words or symbols in a string containing a variety of words and symbols?
Ex.
You won't be able to use regular expressions for this problem, because regex only matches regular languages. The pattern you are trying to match is context-sensitive, and therefore not "regular."
Fortunately, it is easy enough to write a parser. Have a look at Per Erik Stendahl's code.
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
See When not to use Regex in C# (or Java, C++ etc)
Of course using a regex to split the string into words may be a useful first step, however String.Split() is clear and it lickly to do everything you need.
As said by others, you need more than a regex to keep track of words:
var words = new HashSet<string>();
string text = "I like the environment. The environment is good.";
text = Regex.Replace(text, "\\w+", m =>
words.Add(m.Value.ToUpperInvariant())
? m.Value
: String.Empty);
Well, Jeff has shown me how to use the magic of in-expression backreferences and the global modifier to make this one happen, so my original answer is inoperative. You should all go vote for Jeff's answer. However, for posterity I'll note that there's a tricky little regex engine sensitivity issue in this one, and if you were using Perl-flavored regex, you would need to do this:
\b(\S+)\b(?=.*\b\1\b.*)
instead of Jeff's answer, because C#'s regex will effectively capture \b
in \1
but PCRE will not.
Regex is not suited for everything. Something like your problem does fall into that category. I would advise you to use a parser instead.
Regular expressions would be a poor choice of "tools" to solve this problem. Perhaps the following could work:
HashSet<string> corpus = new HashSet<string>();
char[] split = new char[] { ' ', '\t', '\r', '\n', '.', ';', ',', ':', ... };
foreach (string line in inputLines)
{
string[] parts = line.Split(split, StringSplitOptions.RemoveEmptyEntries);
foreach (string part in parts)
{
corpus.Add(part.ToUpperInvariant());
}
}
// 'corpus' now contains all of the unique tokens
EDIT: This is me making a big assumption that you're "lexing" for some sort of analysis like searching.