Handling Surrogate pairs while parsing xml using libxml2

坚强是说给别人听的谎言 提交于 2020-01-14 06:22:12

问题


I am trying to parse xml using libxml2. However, sometimes I get code points of surrogate pairs in it which are outside the range specified in http://www.w3.org/TR/REC-xml/#NT-Char
Because of this, my libxml2 parser is not able to parse it and thus I get error. Can somebody tell me how to handle surrogate pairs while parsing XML using libxml2.

An example xml I want to parse is:

<?xml version="1.0" encoding="UTF-8"?>
<message><body>  &#xD83D;&#xD83D;</body></message>

回答1:


Note that xD83D is a high surrogate. A surrogate pair consists of a high surrogate and a low surrogate; having two high surrogates next to each other is not a "surrogate pair", it is nonsense.

Also note that the correct way to represent a non-BMP character in XML is as a single character reference for the combined character, for example &#x120AB;. Splitting a non-BMP character into two surrogates is needed in some character encodings, but it is not needed (or allowed) in XML character references. Character references in XML represent Unicode code-points, not the numeric values specific to a particular character encoding.

If you can't fix the program that created this bad XML, then the best approach would be to repair it using a script e.g. in Perl that looks for the invalid character references pairs and replaces them with the correct XML representation.




回答2:


If XML standard doesn't allow these characters then parser will throw error. One way to include these characters in xml is to place them inside CDATA segment. they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

<message><body>  <![CDATA[&#xD83D;&#xD83D;&#xD83D;]]></body></message>

The above xml will get parsed properly.



来源:https://stackoverflow.com/questions/23239432/handling-surrogate-pairs-while-parsing-xml-using-libxml2

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!