Why can some ASCII characters not be expressed in the form '\uXXXX' in Java source code?

前端 未结 5 1603
被撕碎了的回忆
被撕碎了的回忆 2020-12-13 12:19

I stumbled over this (again) today:

class Test {
    char ok = \'\\n\';
    char okAsWell = \'\\u000B\';
    char error = \'\\u000A\';
}

It

相关标签:
5条回答
  • 2020-12-13 12:58

    Because the compiler treats them the same as unescaped text.

    This is valid code:

     class \u00C9 {}
    
    0 讨论(0)
  • 2020-12-13 13:01

    Unicode escape sequences like \u000a are replaced by the actual characters they represent before the Java compiler does anything else with the source code. And so, your program eventually ends up at

    char ch = '
    ';
    

    So the \u000a in your source code is replaced internally by a linefeed character. Note that this happens before the compiler actually reads and interprets your source code.

    Referring to the Java Language Specification:

    It is a compile-time error for a line terminator (§3.4) to appear after the opening ' and before the closing '.

    And as well all know by heart, \n is a line terminator, quoting:

     LineTerminator:
        the ASCII LF character, also known as "newline"
        the ASCII CR character, also known as "return"
        the ASCII CR character followed by the ASCII LF character
    

    Other symbols that could cause problems are \, ' and " for example.

    0 讨论(0)
  • 2020-12-13 13:04

    Unicode characters are replaced by their value, so your line is replaced by the compiler with:

    char error = '
    ';
    

    which is not a valid Java statement.

    This is dictated by the Language Specification:

    A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.

    This can lead to surprising stuff, for example, this is a valid Java program (it contains hidden unicode characters) - courtesy of Peter Lawrey:

    public static void main(String[] args) {
        for (char c‮h = 0; c‮h < Character.MAX_VALUE; c‮h++) {
            if (Character.isJavaIdentifierPart(c‮h) && !Character.isJavaIdentifierStart(c‮h)) {
                System.out.printf("%04x <%s>%n", (int) c‮h, "" + c‮h);
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-13 13:05

    It is described in 3.3. Unicode Escapes http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html. Javac first finds \uxxxx sequences in .java and replaces them with real characters then compiles. In case of

    char error = '\u000A';
    

    \u000A will be replace with newline character code (10) and the actual text will be

    char error = '
    ';
    
    0 讨论(0)
  • 2020-12-13 13:07

    I think the reason is that \uXXXX sequences are expanded when the code is being parsed, see JLS §3.2. Lexical Translations.

    0 讨论(0)
提交回复
热议问题