Regex and escaped and unescaped delimiter

前端未结

关注

 5  1545

question related to this

I have a string

a\\;b\\\\;c;d

which in Java looks like

String s = \"a\\\\;b\\\\\\\\;c;d\"


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  别跟我提以往        
                
              
                            
                2020-12-18 01:15
              
            
            
                                                                       
String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");


This should work.

Explanation : 

// (?<!(?<!\\)\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
//       Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»


So you just match the semicolons not preceded by exactly one \.

EDIT :

String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");


This will take care of any odd number of . It will of course fail if you have more than 4000000 number of \. Explanation of edited answer :

// (?<!(?<!\\(\\\\){0,2000000})\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
//       Match the character “\” literally «\\»
//       Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
//          Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
//          Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
//          Match the character “\” literally «\\»
//          Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2020-12-18 01:16
              
            
            
                                                                       
This approach assumes that your string will not have char '\0' in your string.  If you do, you can use some other char.

public static String[] split(String s) {
    String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
    for (int i = 0; i < result.length; i++) {
        result[i] = result[i].replaceAll("\0", "\\\\;");
    }
    return result;
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2020-12-18 01:16
              
            
            
                                                                       
This is the real answer i think.
In my case i am trying to split using | and escape character is &.

    final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
    String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
    System.out.println(Arrays.toString(res));


In this code i am using Lookbehind  to escape & character.
note that the look behind must have maximum length.

(?<!((?:[^&]|^)(&&){0,10000}&))\\|


this means any | except those that are following ((?:[^&]|^)(&&){0,10000}&)) and this part means any odd number of &s.
the part (?:[^&]|^) is important to make sure that you are counting all of the &s behind the | to the beginning or some other characters.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2020-12-18 01:21
              
            
            
                                                                       
You can use the regex

(?:\\.|[^;\\]++)*


to match all text between unescaped semicolons:

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    } 


Explanation:

(?:        # Match either...
 \\.       # any escaped character
|          # or...
 [^;\\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.


The possessive match (++) is important to avoid catastrophic backtracking because of the nested quantifiers.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2020-12-18 01:22
              
            
            
                                                                       
I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using C since it's ages ago I last touched Java ;-)

int i, len, state;
char c;

for (len=myString.size(), state=0, i=0; i < len; i++) {
    c=myString[i];
    if (state == 0) {
       if (c == '\\') {
            state++;
       } else if (c == ';') {
           printf("; at offset %d", i);
       }
    } else {
        state--;
    }
}


The advantages are:


you can execute semantic actions on each step.
it's quite easy to port it to another language.
you don't need to include the complete regex library just for this simple task, which adds to portability.
it should be a lot faster than the regular expression matcher.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复