问题
I have strings with a variety of characters that need to be written to Google BigQuery, which requires strict UTF8 strings. When trying to write strings with a wide variety of emoji input, I get an error:
java.lang.IllegalArgumentException: Unpaired surrogate at index 3373
at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLengthGeneral(Utf8.java:93)
at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLength(Utf8.java:67)
at org.apache.beam.sdk.coders.StringUtf8Coder.getEncodedElementByteSize(StringUtf8Coder.java:145)
...
I have a workaround for this problem, to simply strip all surrogate characters from Strings:
private static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char c = query.charAt(i);
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(c);
}
}
return sb.toString();
}
However, this results in a string like
🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒
Being reduced to just four emojis
⚔⌨⛳⛏
Is there a proper way to convert these characters into UTF8 without loss, and without using unpaired surrogates?
(Apologies, my understanding of character sets in general is not great)
回答1:
I found the problem. We are using org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 to convert HTML entities in strings to their non-encoded forms. This seems to mangle some non-latin characters. For example, passing the string "Italien 🇮🇹" through this method converts it into "Italien 🇮?" (the last character gets mangled)
Passing "🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒" through this method converts it to "🍍?🥔?🍵?🍵?🏺?🎧?🎚?🎙?⚔⌨🎳?⛳🏓?🌏?🏝?"
import org.apache.commons.lang3.StringEscapeUtils;
public class CharacterTest {
public static void main(String[] args) {
String good = "🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒";
String bad = StringEscapeUtils.unescapeHtml4(good);
System.out.println(good + "->" + bad);
}
}
🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒->🍍?🥔?🍵?🍵?🏺?🎧?🎚?🎙?⚔⌨🎳?⛳🏓?🌏?🏝?
Now to find an alternative HTML entity decoder...
回答2:
Is there a proper way to convert these characters into UTF8
Probably, if you just send the string it'll be converted to UTF-8. That's how Java's encoders work.
If it doesn't and you are sending binary, you can just convert directly:
private static byte[] removeSurrogates(String query) {
return query.getBytes( "UTF-8" );
}
回答3:
Let me get out of Java for a second to show that BigQuery can deal with emojis:
CREATE TABLE `public_dump.emoji_test`
AS
SELECT "🍍🥔🍵🍵🏺🎧🎚🎙⚔⌨🎳⛳🏓🌏🏝🏝🕘🕒🕢🕠🎵🔇🎸🗓🔏⛏🔒" emojis
Then to test for presence:
SELECT COUNT(*)
FROM `fh-bigquery.public_dump.emoji_test`
WHERE emojis LIKE '%🎳%'
1
Doing this with Python is straightforward:
Inserting new data isn't a problem either:
I'm sorry I don't know how to fix this with Java, but I hope that it helps to see this proof of BigQuery's API abilities to handle emojis with grace.
来源:https://stackoverflow.com/questions/55699140/how-can-i-encode-characters-like-emojis-as-utf8-without-unpaired-surrogate-chara