I recently harnessed the power of a look-ahead regular expression to split a String:
"abc8".split("(?=\\d)|\\W")
If printed to the console this expression returns:
[abc, 8]
Very pleased with this result, I wanted to transfer this to Guava for further development, which looked like this:
Splitter.onPattern("(?=\\d)|\\W").split("abc8")
To my surprise the output changed to:
[abc]
Why?
You found a bug!
System.out.println(s.split("abc82")); // [abc, 8] System.out.println(s.split("abc8")); // [abc]
This is the method that Splitter uses to actually split Strings (Splitter.SplittingIterator::computeNext):
@Override protected String computeNext() { /* * The returned string will be from the end of the last match to the * beginning of the next one. nextStart is the start position of the * returned substring, while offset is the place to start looking for a * separator. */ int nextStart = offset; while (offset != -1) { int start = nextStart; int end; int separatorPosition = separatorStart(offset); if (separatorPosition == -1) { end = toSplit.length(); offset = -1; } else { end = separatorPosition; offset = separatorEnd(separatorPosition); } if (offset == nextStart) { /* * This occurs when some pattern has an empty match, even if it * doesn't match the empty string -- for example, if it requires * lookahead or the like. The offset must be increased to look for * separators beyond this point, without changing the start position * of the next returned substring -- so nextStart stays the same. */ offset++; if (offset >= toSplit.length()) { offset = -1; } continue; } while (start < end && trimmer.matches(toSplit.charAt(start))) { start++; } while (end > start && trimmer.matches(toSplit.charAt(end - 1))) { end--; } if (omitEmptyStrings && start == end) { // Don't include the (unused) separator in next split string. nextStart = offset; continue; } if (limit == 1) { // The limit has been reached, return the rest of the string as the // final item. This is tested after empty string removal so that // empty strings do not count towards the limit. end = toSplit.length(); offset = -1; // Since we may have changed the end, we need to trim it again. while (end > start && trimmer.matches(toSplit.charAt(end - 1))) { end--; } } else { limit--; } return toSplit.subSequence(start, end).toString(); } return endOfData(); }
The area of interest is:
if (offset == nextStart) { /* * This occurs when some pattern has an empty match, even if it * doesn't match the empty string -- for example, if it requires * lookahead or the like. The offset must be increased to look for * separators beyond this point, without changing the start position * of the next returned substring -- so nextStart stays the same. */ offset++; if (offset >= toSplit.length()) { offset = -1; } continue; }
This logic works great, unless the empty match happens at the end of a String. If the empty match does occur at the end of a String, it will end up skipping that character. What this part should look like is (notice >= -> >):
if (offset == nextStart) { /* * This occurs when some pattern has an empty match, even if it * doesn't match the empty string -- for example, if it requires * lookahead or the like. The offset must be increased to look for * separators beyond this point, without changing the start position * of the next returned substring -- so nextStart stays the same. */ offset++; if (offset > toSplit.length()) { offset = -1; } continue; }
The Guava Splitter seems to have a bug when a pattern matches an empty string. If you try creating a Matcher and printing out what it matches:
Pattern pattern = Pattern.compile("(?=\\d)|\\W"); Matcher matcher = pattern.matcher("abc8"); while (matcher.find()) { System.out.println(matcher.start() + "," + matcher.end()); }
You get the output 3,3 which makes it look like it would match the 8. Therefore it simply splits there resulting only abc.
You can use e.g. Pattern#split(String) which seems to give the correct output:
Pattern.compile("(?=\\d)|\\W").split("abc8")