How can I encode characters like emojis as UTF8 without unpaired surrogate characters?

问题

I have strings with a variety of characters that need to be written to Google BigQuery, which requires strict UTF8 strings. When trying to write strings with a wide variety of emoji input, I get an error:

java.lang.IllegalArgumentException: Unpaired surrogate at index 3373
    at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLengthGeneral(Utf8.java:93)
    at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLength(Utf8.java:67)
    at org.apache.beam.sdk.coders.StringUtf8Coder.getEncodedElementByteSize(StringUtf8Coder.java:145)
...

I have a workaround for this problem, to simply strip all surrogate characters from Strings:

    private static String removeSurrogates(String query) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < query.length(); i++) {
            char c = query.charAt(i);
            if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
                sb.append(c);
            }
        }
        return sb.toString();
    }

However, this results in a string like

🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒

Being reduced to just four emojis

⚔⌨⛳⛏

Is there a proper way to convert these characters into UTF8 without loss, and without using unpaired surrogates?

(Apologies, my understanding of character sets in general is not great)

回答1:

I found the problem. We are using org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 to convert HTML entities in strings to their non-encoded forms. This seems to mangle some non-latin characters. For example, passing the string "Italien 🇮🇹" through this method converts it into "Italien 🇮?" (the last character gets mangled)

Passing "🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒" through this method converts it to "🍍?🥔?🍵?🍵?🏺?🎧?🎚?🎙?⚔⌨🎳?⛳🏓?🌏?🏝?"

import org.apache.commons.lang3.StringEscapeUtils;

public class CharacterTest {
    public static void main(String[] args) {
        String good = "🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒";
        String bad = StringEscapeUtils.unescapeHtml4(good);
        System.out.println(good + "->" + bad);
    }
}

🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒->🍍?🥔?🍵?🍵?🏺?🎧?🎚?🎙?⚔⌨🎳?⛳🏓?🌏?🏝?

Now to find an alternative HTML entity decoder...

回答2:

Is there a proper way to convert these characters into UTF8 Probably, if you just send the string it'll be converted to UTF-8. That's how Java's encoders work.

If it doesn't and you are sending binary, you can just convert directly:

private static byte[] removeSurrogates(String query) {
    return query.getBytes( "UTF-8" );
}

回答3:

Let me get out of Java for a second to show that BigQuery can deal with emojis:

CREATE TABLE `public_dump.emoji_test`
AS
SELECT "🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒" emojis

Then to test for presence:

SELECT COUNT(*)
FROM `fh-bigquery.public_dump.emoji_test`
WHERE emojis LIKE '%🎳%'

1

Doing this with Python is straightforward:

Inserting new data isn't a problem either:

I'm sorry I don't know how to fix this with Java, but I hope that it helps to see this proof of BigQuery's API abilities to handle emojis with grace.

来源：https://stackoverflow.com/questions/55699140/how-can-i-encode-characters-like-emojis-as-utf8-without-unpaired-surrogate-chara

标签

java

google-bigquery

google-cloud-dataflow

emoji