Regex in java for finding duplicate consecutive words

后端未结

关注

 6  1871

I saw this as an answer for finding repeated words in a string. But when I use it, it thinks This and is are the same and deletes the is


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2020-12-14 19:47
              
            
            
                                                                       
\b(\w+)(\b\W+\1\b)*


Explanation:

\b : Any word boundary <br/>(\w+) : Select any word character (letter, number, underscore)


Once all the words are selected, now it's time to select the common words.

( : Grouping starts<br/>
\b : Any word boundary<br/>
\W+ : Any non-word character<br/>
\1 : Select repeated words<br/>
\b : Un select if it repeated word is joined with another word<br/>
) : Grouping ends


Reference : Example
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2020-12-14 19:48
              
            
            
                                                                       
Try this one:

String pattern = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);

String input = "your string";
Matcher m = r.matcher(input);
while (m.find()) {
    input = input.replaceAll(m.group(), m.group(1));
}
System.out.println(input);


The Java regular expressions are explained very well in the API documentation of the Pattern class. After adding some spaces to indicate the different parts of the regular expression:

"(?i) \\b ([a-z]+) \\b (?: \\s+ \\1 \\b )+"

\b       match a word boundary
[a-z]+   match a word with one or more characters;
         the parentheses capture the word as a group    
\b       match a word boundary
(?:      indicates a non-capturing group (which starts here)
\s+      match one or more white space characters
\1       is a back reference to the first (captured) group;
         so the word is repeated here
\b       match a word boundary
)+       indicates the end of the non-capturing group and
         allows it to occur one or more times

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-12-14 19:51
              
            
            
                                                                       
The below pattern will match duplicate words even with any number of occurrences. 

Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); 


For e-g, "This is is my my my pal pal pal pal pal pal pal pal"
will output "This is my pal"

Also, Only one iteration with "while (m.find())" is enough with this pattern.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2020-12-14 19:51
              
            
            
                                                                       
I believe this is the regular expression you should be using to detect 2 consecutive words separated by any number of non-word characters:

Pattern p = Pattern.compile("\\b(\\w+)\\b\\W+\\b\\1\\b", Pattern.CASE_INSENSITIVE);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-12-14 19:55
              
            
            
                                                                       
you should have used \b(\w+)\b\s+\b\1\b, click here to see the result...

Hope this is what you want...

Update 1

Well well well, the output that you have is 

the final string after removing duplicates

import java.util.regex.*;

public class MyDup {
    public static void main (String args[]) {
    String input="This This is text text another another";
    String originalText = input;
    String output = "";
    Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(input);
    System.out.println(m);
    if (!m.find())
        output = "No duplicates found, no changes made to data";
    else
    {
        while (m.find())
        {
            if (output == "") {
                output = input.replaceFirst(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
        input = output;
        m = p.matcher(input);
        while (m.find())
        {
            output = "";
            if (output == "") {
                output = input.replaceAll(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
    }
    System.out.println("After removing duplicate the final string is " + output);
}


Run this code and see what you get as output... Your queries will be solved...

Note

In output you are replacing duplicate by single word... Isn't it??

When I put System.out.println(m.group() + " : " + m.group(1)); in first if condition I get output as text text : text i.e. duplicates are replacing by single word.

else
    {
        while (m.find())
        {
            if (output == "") {
                System.out.println(m.group() + " : " + m.group(1));
                output = input.replaceFirst(m.group(), m.group(1));
            } else {


Hope you got now what is going on... :)

Good Luck!!! Cheers!!!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2020-12-14 19:56
              
            
            
                                                                       
if unicodes are important than you should use this: 

 Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*",
        Pattern.MULTILINE + Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CHARACTER_CLASS)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复