Matching token sequences

自闭症网瘾萝莉.ら 提交于 2019-12-13 16:02:21

问题


I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.

          11111
012345678901234
ab ab bc  a cba

In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.

Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?

I am hoping to implement this in Java.


回答1:


Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':

(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}

So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.

Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):

public static void find(String haystack, String tokens, int windowLen) {
    char[] tokenChars = tokens.toCharArray();
    int hayLen = haystack.length();

    int pos = 0;
    nextPos:
    while (pos + windowLen <= hayLen) {
        for (char c : tokenChars) {
            int i = haystack.indexOf(c, pos);
            if (i < 0) return;

            if (i - pos >= windowLen) {
                pos = i - windowLen + 1;
                continue nextPos;
            }
        }

        // match found at pos
        System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
        pos++;
    }
}



回答2:


This tested Java program has a commented regex which does the trick:

import java.util.regex.*;
public class TEST {
    public static void main(String[] args) {
        String s = "ab ab bc  a cba";
        Pattern p = Pattern.compile(
            "# Match 5 char sequences containing: a and b and c\n" +
            "(?=[abc])     # Assert first char is a, b or c.\n" +
            "(?=.{0,4}a)   # Assert an 'a' within 5 chars.\n" +
            "(?=.{0,4}b)   # Assert an 'b' within 5 chars.\n" +
            "(?=.{0,4}c)   # Assert an 'c' within 5 chars.\n" +
            ".{5}          # If so, match the 5 chers.", 
            Pattern.COMMENTS);
        Matcher m = p.matcher(s);
        while (m.find()) {
            System.out.print("Match = \""+ m.group() +"\"\n");
        } 
   }
}

Note that there is another valid sequence S9:13" a cb" in your test data (before the S12:14"cba". Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a, b or c.

Here is the output from the script:

Match = "ab bc"
Match = "a cba"




回答3:


Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:

abc..|ab.c.|ab..c| .... etc.

This can be factorised somewhat:

ab(c..|.c.|..c)|a.(bc.|b.c .... etc.

I'm not sure if you can do better with regex.




回答4:


Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc  a cba";
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.group());
}

output:

ab bc
 a cb

This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1, \2, \3) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.

The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.



来源:https://stackoverflow.com/questions/5838542/matching-token-sequences

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!