How is Guava Splitter.onPattern(..).split() different from String.split(..)?

匿名 (未验证) 提交于 2019-12-03 02:50:02

问题:

I recently harnessed the power of a look-ahead regular expression to split a String:

"abc8".split("(?=\\d)|\\W") 

If printed to the console this expression returns:

[abc, 8] 

Very pleased with this result, I wanted to transfer this to Guava for further development, which looked like this:

Splitter.onPattern("(?=\\d)|\\W").split("abc8") 

To my surprise the output changed to:

[abc] 

Why?

回答1:

You found a bug!

System.out.println(s.split("abc82")); // [abc, 8] System.out.println(s.split("abc8"));  // [abc] 

This is the method that Splitter uses to actually split Strings (Splitter.SplittingIterator::computeNext):

@Override protected String computeNext() {   /*    * The returned string will be from the end of the last match to the    * beginning of the next one. nextStart is the start position of the    * returned substring, while offset is the place to start looking for a    * separator.    */   int nextStart = offset;   while (offset != -1) {     int start = nextStart;     int end;      int separatorPosition = separatorStart(offset);      if (separatorPosition == -1) {       end = toSplit.length();       offset = -1;     } else {       end = separatorPosition;       offset = separatorEnd(separatorPosition);     }      if (offset == nextStart) {       /*        * This occurs when some pattern has an empty match, even if it        * doesn't match the empty string -- for example, if it requires        * lookahead or the like. The offset must be increased to look for        * separators beyond this point, without changing the start position        * of the next returned substring -- so nextStart stays the same.        */       offset++;       if (offset >= toSplit.length()) {         offset = -1;       }       continue;     }      while (start < end && trimmer.matches(toSplit.charAt(start))) {       start++;     }     while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {       end--;     }      if (omitEmptyStrings && start == end) {       // Don't include the (unused) separator in next split string.       nextStart = offset;       continue;     }      if (limit == 1) {       // The limit has been reached, return the rest of the string as the       // final item.  This is tested after empty string removal so that       // empty strings do not count towards the limit.       end = toSplit.length();       offset = -1;       // Since we may have changed the end, we need to trim it again.       while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {         end--;       }     } else {       limit--;     }      return toSplit.subSequence(start, end).toString();   }   return endOfData(); } 

The area of interest is:

if (offset == nextStart) {   /*    * This occurs when some pattern has an empty match, even if it    * doesn't match the empty string -- for example, if it requires    * lookahead or the like. The offset must be increased to look for    * separators beyond this point, without changing the start position    * of the next returned substring -- so nextStart stays the same.    */   offset++;   if (offset >= toSplit.length()) {     offset = -1;   }   continue; } 

This logic works great, unless the empty match happens at the end of a String. If the empty match does occur at the end of a String, it will end up skipping that character. What this part should look like is (notice >= -> >):

if (offset == nextStart) {   /*    * This occurs when some pattern has an empty match, even if it    * doesn't match the empty string -- for example, if it requires    * lookahead or the like. The offset must be increased to look for    * separators beyond this point, without changing the start position    * of the next returned substring -- so nextStart stays the same.    */   offset++;   if (offset > toSplit.length()) {     offset = -1;   }   continue; } 


回答2:

The Guava Splitter seems to have a bug when a pattern matches an empty string. If you try creating a Matcher and printing out what it matches:

Pattern pattern = Pattern.compile("(?=\\d)|\\W"); Matcher matcher = pattern.matcher("abc8"); while (matcher.find()) {     System.out.println(matcher.start() + "," + matcher.end()); } 

You get the output 3,3 which makes it look like it would match the 8. Therefore it simply splits there resulting only abc.

You can use e.g. Pattern#split(String) which seems to give the correct output:

Pattern.compile("(?=\\d)|\\W").split("abc8") 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!