Java - read UTF-8 file with a single emoji symbol

问题

I have a file with a single unicode symbol.
The file is encoded in UTF-8.
It contains a single symbol represented as 4 bytes.
https://www.fileformat.info/info/unicode/char/1f60a/index.htm

F0 9F 98 8A

When I read the file I get two symbols/chars.

The program below prints

?
2
?
?
55357
56842
======================================
&#55357;&#56842;
16
&
======================================
?
2
?
======================================

Is this normal... or a bug? Or am I misusing something?
How do I get that single emoji symbol in my code?

EDIT: And also... how do I escape it for XML?

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class Test008 {

    public static void main(String[] args) throws Exception{
        BufferedReader in = new BufferedReader(
                   new InputStreamReader(
                              new FileInputStream("D:\\DATA\\test1.txt"), "UTF8"));
        
        String s = "";
        while ((s = in.readLine()) != null) {
            System.out.println(s);
            System.out.println(s.length());
            System.out.println(s.charAt(0));
            System.out.println(s.charAt(1));
            
            System.out.println((int)(s.charAt(0)));
            System.out.println((int)(s.charAt(1)));
            
            String z = org.apache.commons.lang.StringEscapeUtils.escapeXml(s);
            String z3 = org.apache.commons.lang3.StringEscapeUtils.escapeXml(s);
            
            System.out.println("======================================");
            System.out.println(z);
            System.out.println(z.length());
            System.out.println(z.charAt(0));
            
            System.out.println("======================================");
            System.out.println(z3);
            System.out.println(z3.length());
            System.out.println(z3.charAt(0));
            
            System.out.println("======================================");

        }

        in.close();
    }

}

回答1:

Yes normal, the Unicode symbol is 2 UTF-16 chars (1 char is 2 bytes).

int codePoint = s.codePointAt(0); // Your code point.
System.out.printf("U+%04X, chars: $d%n", codePoint, Character.charCount(cp));

U+F09F988A, chars: 2

After comments

Java, using a Stream:

public static String escapeToAsciiHTML(String s) {
    StringBuilder sb = new StringBuilder();
    s.codePoints().forEach(cp -> {
        if (cp < 128) {
            sb.append((char) cp);
        } else{
            sb.append("&#").append(cp).append(";");
        }
    });
    return sb.toString();
}

回答2:

StringEscapeUtils is broken. Don't use it. Try NumericEntityEscaper.

Or, better yet, as apache commons libraries tend to be bad API** and broken*** anyway, guava*'s XmlEscapers

java is unicode, yes, but 'char' is a lie. 'char' does not represent characters; it represents a single, unsigned 16 bit number. The actual method to get a character out of, say, a j.l.String object isn't charAt, which is a misnomer; it's codepointAt, and friends.

This (char being a fakeout) normally doesn't matter; most actual characters fit in the 16-bit char type. But when they don't, this matters, and that emoji doesn't fit. In the unicode model used by java and the char type, you then get 2 char values (representing a single unicode character). This pair is called a 'surrogate pair'.

Note that the right methods tend to work in int (you need the 32 bits to represent one single unicode symbol, after all).

*) guava has its own issues, by being aggressively not backwards compatible with itself, it tends to lead to dependency hell. It's a pick your poison kind of deal, unfortunately.

**) Utils-anything is usually a sign of bad API design; 'util' is almost meaningless as a term and usually implies you've broken the object oriented model. The right model is of course to have an object representing the process of translating data in one form (say, a raw string) to another (say, a string that can be dumped straight into an XML file, escaped and well) - and such a thing would thus be called an 'escaper', and would live perhaps in a package named 'escapers' or 'text'. Later editions of apache libraries, as well as guava, fortunately 'fixed' this.

***) As this very example shows, these APIs often don't do what you want them to. Note that apache is open source; if you want these APIs to be better, they accept pull requests :)

来源：https://stackoverflow.com/questions/63133697/java-read-utf-8-file-with-a-single-emoji-symbol

标签

java

unicode

encoding

java-8