Emoji character sequence �� breaks old XML process

萝らか妹 提交于 2020-07-22 05:22:04

问题


I have an old Java application that processes XML from a third-party data feed.

The data feed allows user-input, and it is now suddenly containing emojis such as �� (👇). I'm actually surprised it took this long for this problem to appear (emojis have been around for a few years now).

The app blows up in javax.xml.parsers.DocumentBuilder.parse(InputStream):

org.xml.sax.SAXParseException; lineNumber: 105; columnNumber: 3039; Character reference "&#
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)

Is there a quick, localized fix that I can apply without having to redesign and rearchitect the whole application? Also, would prefer to avoid a regex search/replace hack since that can introduce other subtle problems.


回答1:


�� is a single character encoded as a surrogate pair (two surrogates). A character reference in XML cannot represent a (high or low) surrogate: these are not legal characters. The character reference should represent the Unicode codepoint of the Emoji as a whole, 👇.

The third party is sending you invalid XML, and you should reject it as you would reject any other faulty goods from a supplier.



来源:https://stackoverflow.com/questions/53038978/emoji-character-sequence-5535756391-breaks-old-xml-process

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!