Why in Java 8 split sometimes removes empty strings at start of result array?

前端 未结 3 1680
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-11-22 03:17

Before Java 8 when we split on empty string like

String[] tokens = \"abc\".split(\"\");

split mechanism would split in pl

3条回答
  •  一整个雨季
    2020-11-22 04:06

    The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

    Documentation

    Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

    When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

    The same clause is also added to String.split in Java 8, compared to Java 7.

    Reference implementation

    Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

    Java 7

    public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList matchList = new ArrayList<>();
        Matcher m = matcher(input);
    
        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }
    
        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};
    
        // Add remaining segment
        if (!matchLimited || matchList.size() < limit)
            matchList.add(input.subSequence(index, input.length()).toString());
    
        // Construct result
        int resultSize = matchList.size();
        if (limit == 0)
            while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
                resultSize--;
        String[] result = new String[resultSize];
        return matchList.subList(0, resultSize).toArray(result);
    }
    

    Java 8

    public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList matchList = new ArrayList<>();
        Matcher m = matcher(input);
    
        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                if (index == 0 && index == m.start() && m.start() == m.end()) {
                    // no empty leading substring included for zero-width match
                    // at the beginning of the input char sequence.
                    continue;
                }
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }
    
        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};
    
        // Add remaining segment
        if (!matchLimited || matchList.size() < limit)
            matchList.add(input.subSequence(index, input.length()).toString());
    
        // Construct result
        int resultSize = matchList.size();
        if (limit == 0)
            while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
                resultSize--;
        String[] result = new String[resultSize];
        return matchList.subList(0, resultSize).toArray(result);
    }
    

    The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

                if (index == 0 && index == m.start() && m.start() == m.end()) {
                    // no empty leading substring included for zero-width match
                    // at the beginning of the input char sequence.
                    continue;
                }
    

    Maintaining compatibility

    Following behavior in Java 8 and above

    To make split behaves consistently across versions and compatible with the behavior in Java 8:

    1. If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
    2. If your regex can't match zero-length string, you don't need to do anything.
    3. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.

    (?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

    Following behavior in Java 7 and prior

    There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

提交回复
热议问题