Handling Surrogate pairs while parsing xml using libxml2

问题

I am trying to parse xml using libxml2. However, sometimes I get code points of surrogate pairs in it which are outside the range specified in http://www.w3.org/TR/REC-xml/#NT-Char
Because of this, my libxml2 parser is not able to parse it and thus I get error. Can somebody tell me how to handle surrogate pairs while parsing XML using libxml2.

An example xml I want to parse is:

<?xml version="1.0" encoding="UTF-8"?>
<message><body>  &#xD83D;&#xD83D;</body></message>

回答1:

Note that xD83D is a high surrogate. A surrogate pair consists of a high surrogate and a low surrogate; having two high surrogates next to each other is not a "surrogate pair", it is nonsense.

Also note that the correct way to represent a non-BMP character in XML is as a single character reference for the combined character, for example 𒂫. Splitting a non-BMP character into two surrogates is needed in some character encodings, but it is not needed (or allowed) in XML character references. Character references in XML represent Unicode code-points, not the numeric values specific to a particular character encoding.

If you can't fix the program that created this bad XML, then the best approach would be to repair it using a script e.g. in Perl that looks for the invalid character references pairs and replaces them with the correct XML representation.

回答2:

If XML standard doesn't allow these characters then parser will throw error. One way to include these characters in xml is to place them inside CDATA segment. they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

<message><body>  <![CDATA[&#xD83D;&#xD83D;&#xD83D;]]></body></message>

The above xml will get parsed properly.

来源：https://stackoverflow.com/questions/23239432/handling-surrogate-pairs-while-parsing-xml-using-libxml2

标签

xml

parsing

xml-parsing

libxml2