Why does \R behave differently in regular expressions between Java 8 and Java 9?

末鹿安然 提交于 2019-12-31 10:57:03

问题


The following code compiles in both Java 8 & 9, but behaves differently.

class Simple {
    static String sample = "\nEn un lugar\r\nde la Mancha\nde cuyo nombre\r\nno quiero acordarme";

    public static void main(String args[]){
        String[] chunks = sample.split("\\R\\R");
        for (String chunk: chunks) {
            System.out.println("Chunk : "+chunk);
        }
    }
}

When I run it with Java 8 it returns:

Chunk : 
En un lugar
de la Mancha
de cuyo nombre
no quiero acordarme

But when I run it with Java 9 the output is different:

Chunk : 
En un lugar
Chunk : de la Mancha
de cuyo nombre
Chunk : no quiero acordarme

Why?


回答1:


The Java documentation is out of conformance with the Unicode Standard. The Javadoc mistates what \R is supposed to match. It reads:

\R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

That Java documentation is buggy. In its section on R1.6 Line Breaks, Unicode Technical Standard #18 on Regular Expressions clearly states:

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.

 (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

In other words, it can only match a two code-point CR+LF (carriage return + linefeed) sequence or else a single code-point from that set provided that it is not just a carriage return alone that is then followed by a linefeed. That’s because it is not allowed to back up. CRLF must be atomic for \R to function properly.

So Java 9 no longer conforms to what R1.6 strongly recommends. Moreover, it is now doing something that it was supposed to NOT do, and did not do, in Java 8.

Looks like it's time for me to give Sherman (read: Xueming Shen) a holler again. I've worked with him before on these nitty-gritty matters of formal conformance.




回答2:


It was a bug in Java 8 and it got fixed: JDK-8176029 : "Linebreak matcher is not equivalent to the pattern as stated in javadoc".

Also see: Java-8 regex negative lookbehind with `\R`



来源:https://stackoverflow.com/questions/47871962/why-does-r-behave-differently-in-regular-expressions-between-java-8-and-java-9

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!