Regular expression to find and remove duplicate words

前端 未结 9 935
孤城傲影
孤城傲影 2020-11-30 09:46

Using regular expressions in C#, is there any way to find and remove duplicate words or symbols in a string containing a variety of words and symbols?

Ex.

相关标签:
9条回答
  • 2020-11-30 10:07

    Have a look at backreferences:
    http://msdn.microsoft.com/en-us/library/thwdfzxy(VS.71).aspx

    This a regex that will find doubled words. But it will match only one word per match. So you have to use it more than once.

    new Regex( @"(.*)\b(\w+)\b(.*)(\2)(.*)", RegexOptions.IgnoreCase );
    

    Of course this is not the best solution (see other answers, which propose not to use a regex at all). But you asked for a regex - here is one. Maybe just the idea helps you ...

    0 讨论(0)
  • 2020-11-30 10:15

    This seems to work for me

    (\b\S+\b)(?=.*\1)
    

    Matches like so

    apple apple orange  
    orange red blue green orange green blue  
    pirates ninjas cowboys ninjas pirates  
    
    0 讨论(0)
  • As others have pointed out, this is doable with backreferences. See http://msdn.microsoft.com/nb-no/library/thwdfzxy(en-us).aspx for the details on how to use backreferences in .Net.

    Your particular problem to remove punctuation as well makes it a bit more complicated, but I think code along these lines (whitespace is not significant in that regex) should do the trick:

    (\b\w+(?:\s+\w+)*)\s+\1
    

    I've not tested the regex at all, but that should match one or more words separated by whitespace that are repeated. You'll have to add some more logic to allow for puncuation and so on.

    0 讨论(0)
提交回复
热议问题