Regex in java for finding duplicate consecutive words

后端 未结 6 1856
时光说笑
时光说笑 2020-12-14 19:26

I saw this as an answer for finding repeated words in a string. But when I use it, it thinks This and is are the same and deletes the is

相关标签:
6条回答
  • 2020-12-14 19:47
    \b(\w+)(\b\W+\1\b)*
    

    Explanation:

    \b : Any word boundary <br/>(\w+) : Select any word character (letter, number, underscore)
    

    Once all the words are selected, now it's time to select the common words.

    ( : Grouping starts<br/>
    \b : Any word boundary<br/>
    \W+ : Any non-word character<br/>
    \1 : Select repeated words<br/>
    \b : Un select if it repeated word is joined with another word<br/>
    ) : Grouping ends
    

    Reference : Example

    0 讨论(0)
  • 2020-12-14 19:48

    Try this one:

    String pattern = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
    Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
    
    String input = "your string";
    Matcher m = r.matcher(input);
    while (m.find()) {
        input = input.replaceAll(m.group(), m.group(1));
    }
    System.out.println(input);
    

    The Java regular expressions are explained very well in the API documentation of the Pattern class. After adding some spaces to indicate the different parts of the regular expression:

    "(?i) \\b ([a-z]+) \\b (?: \\s+ \\1 \\b )+"
    
    \b       match a word boundary
    [a-z]+   match a word with one or more characters;
             the parentheses capture the word as a group    
    \b       match a word boundary
    (?:      indicates a non-capturing group (which starts here)
    \s+      match one or more white space characters
    \1       is a back reference to the first (captured) group;
             so the word is repeated here
    \b       match a word boundary
    )+       indicates the end of the non-capturing group and
             allows it to occur one or more times
    
    0 讨论(0)
  • 2020-12-14 19:51

    The below pattern will match duplicate words even with any number of occurrences.

    Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); 
    

    For e-g, "This is is my my my pal pal pal pal pal pal pal pal" will output "This is my pal"

    Also, Only one iteration with "while (m.find())" is enough with this pattern.

    0 讨论(0)
  • 2020-12-14 19:51

    I believe this is the regular expression you should be using to detect 2 consecutive words separated by any number of non-word characters:

    Pattern p = Pattern.compile("\\b(\\w+)\\b\\W+\\b\\1\\b", Pattern.CASE_INSENSITIVE);
    
    0 讨论(0)
  • 2020-12-14 19:55

    you should have used \b(\w+)\b\s+\b\1\b, click here to see the result...

    Hope this is what you want...

    Update 1

    Well well well, the output that you have is

    the final string after removing duplicates

    import java.util.regex.*;
    
    public class MyDup {
        public static void main (String args[]) {
        String input="This This is text text another another";
        String originalText = input;
        String output = "";
        Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(input);
        System.out.println(m);
        if (!m.find())
            output = "No duplicates found, no changes made to data";
        else
        {
            while (m.find())
            {
                if (output == "") {
                    output = input.replaceFirst(m.group(), m.group(1));
                } else {
                    output = output.replaceAll(m.group(), m.group(1));
                }
            }
            input = output;
            m = p.matcher(input);
            while (m.find())
            {
                output = "";
                if (output == "") {
                    output = input.replaceAll(m.group(), m.group(1));
                } else {
                    output = output.replaceAll(m.group(), m.group(1));
                }
            }
        }
        System.out.println("After removing duplicate the final string is " + output);
    }
    

    Run this code and see what you get as output... Your queries will be solved...

    Note

    In output you are replacing duplicate by single word... Isn't it??

    When I put System.out.println(m.group() + " : " + m.group(1)); in first if condition I get output as text text : text i.e. duplicates are replacing by single word.

    else
        {
            while (m.find())
            {
                if (output == "") {
                    System.out.println(m.group() + " : " + m.group(1));
                    output = input.replaceFirst(m.group(), m.group(1));
                } else {
    

    Hope you got now what is going on... :)

    Good Luck!!! Cheers!!!

    0 讨论(0)
  • 2020-12-14 19:56

    if unicodes are important than you should use this:

     Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*",
            Pattern.MULTILINE + Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CHARACTER_CLASS)
    
    0 讨论(0)
提交回复
热议问题