Regex and escaped and unescaped delimiter

前端 未结 5 1545
我寻月下人不归
我寻月下人不归 2020-12-18 00:38

question related to this

I have a string

a\\;b\\\\;c;d

which in Java looks like

String s = \"a\\\\;b\\\\\\\\;c;d\"         


        
相关标签:
5条回答
  • 2020-12-18 01:15
    String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");
    

    This should work.

    Explanation :

    // (?<!(?<!\\)\\);
    // 
    // Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
    //    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
    //       Match the character “\” literally «\\»
    //    Match the character “\” literally «\\»
    // Match the character “;” literally «;»
    

    So you just match the semicolons not preceded by exactly one \.

    EDIT :

    String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");
    

    This will take care of any odd number of . It will of course fail if you have more than 4000000 number of \. Explanation of edited answer :

    // (?<!(?<!\\(\\\\){0,2000000})\\);
    // 
    // Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
    //    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
    //       Match the character “\” literally «\\»
    //       Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
    //          Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
    //          Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
    //          Match the character “\” literally «\\»
    //          Match the character “\” literally «\\»
    //    Match the character “\” literally «\\»
    // Match the character “;” literally «;»
    
    0 讨论(0)
  • 2020-12-18 01:16

    This approach assumes that your string will not have char '\0' in your string. If you do, you can use some other char.

    public static String[] split(String s) {
        String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
        for (int i = 0; i < result.length; i++) {
            result[i] = result[i].replaceAll("\0", "\\\\;");
        }
        return result;
    }
    
    0 讨论(0)
  • 2020-12-18 01:16

    This is the real answer i think. In my case i am trying to split using | and escape character is &.

        final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
        String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
        System.out.println(Arrays.toString(res));
    

    In this code i am using Lookbehind to escape & character. note that the look behind must have maximum length.

    (?<!((?:[^&]|^)(&&){0,10000}&))\\|
    

    this means any | except those that are following ((?:[^&]|^)(&&){0,10000}&)) and this part means any odd number of &s. the part (?:[^&]|^) is important to make sure that you are counting all of the &s behind the | to the beginning or some other characters.

    0 讨论(0)
  • 2020-12-18 01:21

    You can use the regex

    (?:\\.|[^;\\]++)*
    

    to match all text between unescaped semicolons:

    List<String> matchList = new ArrayList<String>();
    try {
        Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
        Matcher regexMatcher = regex.matcher(subjectString);
        while (regexMatcher.find()) {
            matchList.add(regexMatcher.group());
        } 
    

    Explanation:

    (?:        # Match either...
     \\.       # any escaped character
    |          # or...
     [^;\\]++  # any character(s) except semicolon or backslash; possessive match
    )*         # Repeat any number of times.
    

    The possessive match (++) is important to avoid catastrophic backtracking because of the nested quantifiers.

    0 讨论(0)
  • 2020-12-18 01:22

    I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using C since it's ages ago I last touched Java ;-)

    int i, len, state;
    char c;
    
    for (len=myString.size(), state=0, i=0; i < len; i++) {
        c=myString[i];
        if (state == 0) {
           if (c == '\\') {
                state++;
           } else if (c == ';') {
               printf("; at offset %d", i);
           }
        } else {
            state--;
        }
    }
    

    The advantages are:

    1. you can execute semantic actions on each step.
    2. it's quite easy to port it to another language.
    3. you don't need to include the complete regex library just for this simple task, which adds to portability.
    4. it should be a lot faster than the regular expression matcher.
    0 讨论(0)
提交回复
热议问题