可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I recently harnessed the power of a look-ahead regular expression to split a String:

"abc8".split("(?=\\d)|\\W")

If printed to the console this expression returns:

[abc, 8]

Very pleased with this result, I wanted to transfer this to Guava for further development, which looked like this:

Splitter.onPattern("(?=\\d)|\\W").split("abc8")

To my surprise the output changed to:

[abc]

Why?

回答1:

You found a bug!

System.out.println(s.split("abc82")); // [abc, 8] System.out.println(s.split("abc8"));  // [abc]

This is the method that Splitter uses to actually split Strings (Splitter.SplittingIterator::computeNext):

@Override protected String computeNext() {   /*    * The returned string will be from the end of the last match to the    * beginning of the next one. nextStart is the start position of the    * returned substring, while offset is the place to start looking for a    * separator.    */   int nextStart = offset;   while (offset != -1) {     int start = nextStart;     int end;      int separatorPosition = separatorStart(offset);      if (separatorPosition == -1) {       end = toSplit.length();       offset = -1;     } else {       end = separatorPosition;       offset = separatorEnd(separatorPosition);     }      if (offset == nextStart) {       /*        * This occurs when some pattern has an empty match, even if it        * doesn't match the empty string -- for example, if it requires        * lookahead or the like. The offset must be increased to look for        * separators beyond this point, without changing the start position        * of the next returned substring -- so nextStart stays the same.        */       offset++;       if (offset >= toSplit.length()) {         offset = -1;       }       continue;     }      while (start < end && trimmer.matches(toSplit.charAt(start))) {       start++;     }     while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {       end--;     }      if (omitEmptyStrings && start == end) {       // Don't include the (unused) separator in next split string.       nextStart = offset;       continue;     }      if (limit == 1) {       // The limit has been reached, return the rest of the string as the       // final item.  This is tested after empty string removal so that       // empty strings do not count towards the limit.       end = toSplit.length();       offset = -1;       // Since we may have changed the end, we need to trim it again.       while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {         end--;       }     } else {       limit--;     }      return toSplit.subSequence(start, end).toString();   }   return endOfData(); }

The area of interest is:

if (offset == nextStart) {   /*    * This occurs when some pattern has an empty match, even if it    * doesn't match the empty string -- for example, if it requires    * lookahead or the like. The offset must be increased to look for    * separators beyond this point, without changing the start position    * of the next returned substring -- so nextStart stays the same.    */   offset++;   if (offset >= toSplit.length()) {     offset = -1;   }   continue; }

This logic works great, unless the empty match happens at the end of a String. If the empty match does occur at the end of a String, it will end up skipping that character. What this part should look like is (notice >= -> >):

if (offset == nextStart) {   /*    * This occurs when some pattern has an empty match, even if it    * doesn't match the empty string -- for example, if it requires    * lookahead or the like. The offset must be increased to look for    * separators beyond this point, without changing the start position    * of the next returned substring -- so nextStart stays the same.    */   offset++;   if (offset > toSplit.length()) {     offset = -1;   }   continue; }

回答2:

The Guava Splitter seems to have a bug when a pattern matches an empty string. If you try creating a Matcher and printing out what it matches:

Pattern pattern = Pattern.compile("(?=\\d)|\\W"); Matcher matcher = pattern.matcher("abc8"); while (matcher.find()) {     System.out.println(matcher.start() + "," + matcher.end()); }

You get the output 3,3 which makes it look like it would match the 8. Therefore it simply splits there resulting only abc.

You can use e.g. Pattern#split(String) which seems to give the correct output:

Pattern.compile("(?=\\d)|\\W").split("abc8")

文章来源: How is Guava Splitter.onPattern(..).split() different from String.split(..)?

标签

string

match

offset

spl