Java unicode escape with more than 4 hexadecimal digits

问题

A Java properties file generated by the Ant <propertyfile> task contains the unicode escape \u000151 for the Hungarian letter ő.

I expected \u0151, is it a bug in Ant? (Ant 1.8.0, Java 1.7.0)

(Based on the JLS only a 4-digit unicode escape is considered valid...)

回答1:

Good place to start reading about Unicode sequences is Javadoc for class Character: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html. I can not say for sure if this is a bug but it looks like a bug. Also you probably can use a utility that converts any text into unicode sequences and vise-versa to play with this issue. You can find the link to an open source library that has such utility among others in this article. Look for paragraph in the article "String Unicode converter"

回答2:

Although I found no bug reports related to the issue, it is probably a bug, based on an official Oracle documentation: Supplementary Characters in the Java Platform

This documentation states that one supplementary character can be represented by two unicode escapes:

For cases where the character encoding used cannot represent the characters directly, the Java programming language provides a Unicode escape syntax. This syntax has not been enhanced to express supplementary characters directly. Instead, they are represented by the two consecutive Unicode escapes for the two code units in the UTF-16 representation of the character. For example, the character U+20000 is written as "\uD840\uDC00".

This document also specifies a syntax for representing a unicode escape syntax for text input (i.e. it is not supported by Java at language level):

For text input, the Java 2 SDK provides a code point input method which accepts strings of the form "\Uxxxxxx", where the uppercase "U" indicates that the escape sequence contains six hexadecimal digits, thus allowing for supplementary characters. A lowercase "u" indicates the original form of the escape sequences, "\uxxxx".

It means that the Ant <propertyfile> task is incorrect, it should generate \u0151 or \U000151 instead of \u000151 (note the uppercase/lowercase U) - at least based on the documentation above.

But in practice the \Uxxxxxx syntax seems to be unsupported:

[test.properties]

key1=\u0151
key2=\u000151
key3=\U000151

[PropertiesParserTest.java]

import static org.junit.Assert.assertEquals;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Properties;
import org.junit.Test;

public class PropertiesParserTest {
    @Test
    public void testLoad() throws IOException {
        try (InputStream input = getClass().getResourceAsStream("test.properties")) {
            Properties p = new Properties();
            p.load(input);

            // Valid unicode escape
            assertEquals("ő", p.getProperty("key1"));

            // The 6-digit unicode escape generated by Ant is incorrect
            assertEquals("\u0001" + "51", p.getProperty("key2"));

            // \Uxxxxxx is not supported
            assertEquals("U000151", p.getProperty("key3"));
        }
    }

    @Test
    public void testGenerate() throws IOException {
        Properties p1 = new Properties();
        p1.setProperty("key1", "ő");
        p1.setProperty("key2", "\u000151");
        // Not supported in practice: p.setProperty("key3", "\U000151");

        File file = File.createTempFile("PropertiesParserTest_", ".properties");
        System.out.println(file);

        try (OutputStream output = new FileOutputStream(file)) {
            p1.store(output, null);
        }

        try (InputStream input = new FileInputStream(file)) {
            Properties p2 = new Properties();
            p2.load(input);

            // Valid unicode escape
            assertEquals("ő", p2.getProperty("key1"));

            // The 6-digit unicode escape generated by Ant is incorrect
            assertEquals("\u0001" + "51", p2.getProperty("key2"));
        }
    }
}

来源：https://stackoverflow.com/questions/37679763/java-unicode-escape-with-more-than-4-hexadecimal-digits

标签

java

ant