All overlapping substrings matching a java regex

巧了我就是萌 提交于 2019-11-29 11:02:29

I faced a similar situation and I tried the above answers but in my case it took too much of time by setting the start and end index of the matcher but I think I've found a better solution, I'm posting it here for others. So below is my code sniplet.

if (textToParse != null) {
Matcher matcher = PLACEHOLDER_PATTERN.matcher(textToParse);
    while(matcher.hitEnd()!=true){
        Boolean result = matcher.find();
        int count = matcher.groupCount();
        System.out.println("Result " +result+" count "+count);
        if(result==true && count==1){
            mergeFieldName = matcher.group(1);
            mergeFieldNames.add(mergeFieldName);
           }
       }
  }

I have used the matcher.hitEnd() method to check if i have reached the end of text.

Hope this helps. Thanks!

It is doable as O(n) only if you specify the range of allowed number length.

Let's say from 2-4 digits (numbers 00-9999): (?=(\\d{2}))(?=(\\1\\d)?)(?=(\\2\\d)?)

This is a zero-length assertion via positive lookahead, capturing such lookahead into groups. The results is an array of all 2-4 digit strings that can be found within the regex input, together with duplicates and empty strings (for non-match captures).

I am not a Java developer, but I believe a Perl script can be read as an example as well.

#!/usr/bin/perl                                       # perl script
use List::MoreUtils qw/ uniq /;                       # uniq subroutine library
$_ = '04/31 412-555-1235';                            # input
my @n = uniq (/(?=(\d{2}))(?=(\1\d)?)(?=(\2\d)?)/g);  # regex (single slash in Perl)
print "$_\n" for grep(/\S/, @n);                      # print non-empty lines

The trick is using backreferences. If you would like to capture 2-5 digit string, you would need to use one more positive lookahead in the regex: (?=(\\d{2}))(?=(\\1\\d)?)(?=(\\2\\d)?)(?=(\\3\\d)?).

I believe this is a closest approach you can make. If this works for you, drop a comment and hopefully some Java developer will edit my answer with Java code for the above script.

The closest you can get is something like this.

"(?=((\\d*)\\d))(?=(\\d)\\d*)"

The result will be in capturing group 1, 2 and 3.

As far as my imagination can go, I can only think of capturing in zero-length assertion as a viable way to recapture the same position of a string. Capturing text outside the zero-length assertion will consume the text once and for all (look-behind can only capture fixed-length in Java, so it can considered to be inaccessible).

This solution is not perfect: aside from repetition (of text at same position!) and empty string matches, it won't capture all possible substrings.

One way to capture all possible substrings is construct the following regex with value of n starting from 1:

"(?=(\\d{" + n + "}))"

And match the string against this for incrementing value of n until there is no match.

This method is of course, inefficient compared to the method of matching all numbers with "\d+" and extract all substring.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!