Using regular expressions in C#, is there any way to find and remove duplicate words or symbols in a string containing a variety of words and symbols?
Ex.
Have a look at backreferences:
http://msdn.microsoft.com/en-us/library/thwdfzxy(VS.71).aspx
This a regex that will find doubled words. But it will match only one word per match. So you have to use it more than once.
new Regex( @"(.*)\b(\w+)\b(.*)(\2)(.*)", RegexOptions.IgnoreCase );
Of course this is not the best solution (see other answers, which propose not to use a regex at all). But you asked for a regex - here is one. Maybe just the idea helps you ...
This seems to work for me
(\b\S+\b)(?=.*\1)
Matches like so
apple apple orange orange red blue green orange green blue pirates ninjas cowboys ninjas pirates
As others have pointed out, this is doable with backreferences. See http://msdn.microsoft.com/nb-no/library/thwdfzxy(en-us).aspx for the details on how to use backreferences in .Net.
Your particular problem to remove punctuation as well makes it a bit more complicated, but I think code along these lines (whitespace is not significant in that regex) should do the trick:
(\b\w+(?:\s+\w+)*)\s+\1
I've not tested the regex at all, but that should match one or more words separated by whitespace that are repeated. You'll have to add some more logic to allow for puncuation and so on.