Html inside XML. Should I use CDATA or encode the HTML

问题

I am using XML to share HTML content. AFAIK, I could embed the HTML either by:

Encoding it: I don\'t know if it is completely safe to use. And I would have to decode it again.
Use CDATA sections: I could still have problems if the content contains the closing tag \"]]>\" and certain hexadecimal characters, I believe. On the other hand, the XML parser would extract the info transparently for me.

Which option should I choose?

UPDATE: The xml will be created in java and passed as a string to a .net web service, were it will be parsed back. Therefore I need to be able to export the xml as a string and load it using \"doc.LoadXml(xmlString);\"

回答1:

The two options are almost exactly the same. Here are your two choices:

<html>This is &lt;b&gt;bold&lt;/b&gt;</html>

<html><![CDATA[This is <b>bold</b>]]></html>

In both cases, you have to check your string for special characters to be escaped. Lots of people pretend that CDATA strings don't need any escaping, but as you point out, you have to make sure that "]]>" doesn't slip in unescaped.

In both cases, the XML processor will return your string to you decoded.

回答2:

CDATA is easier to read by eye while encoded content can have end of CDATA markers in it safely — but you don't have to care. Just use an XML library and stop worrying about it. Then all you have to say is "Put this text inside this element" and the library will either encode it or wrap it in CDATA markers.

回答3:

CDATA for simplicity.

回答4:

If you use CDATA, then you must decode it correctly (textContent, value and innerHTML are methods that will NOT return the proper data).

let us say that you use an xml structure similar to this:

<response>
    <command method="setcontent">
        <fieldname>flagOK</fieldname>
        <content>479</content>
    </command>
    <command method="setcontent">
        <fieldname>htmlOutput</fieldname>
        <content>
            <![CDATA[
            <tr><td>2013/12/05 02:00 - 2013/12/07 01:59 </td></tr><tr><td width="90">Rastreado</td><td width="60">Placa</td><td width="100">Data hora</td><td width="60" align="right">Km/h</td><td width="40">Direção</td><td width="40">Azimute</td><td>Mapa</td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 13:55</td><td align='right'>113</td><td align='right'>NE</td><td align='right'>40</td><td><a href="http://maps.google.com/maps?q=-22.6766,-50.2218&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.6766,-50.2218</a></td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 13:56</td><td align='right'>112</td><td align='right'>NE</td><td align='right'>23</td><td><a href="http://maps.google.com/maps?q=-22.6638,-50.2106&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.6638,-50.2106</a></td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 18:00</td><td align='right'>111</td><td align='right'>SE</td><td align='right'>118</td><td><a href="http://maps.google.com/maps?q=-22.7242,-50.2352&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.7242,-50.2352</a></td></tr>
            ]]>
        </content>
    </command>
</response>

in javascript, then you will decode by loading the xml (jquery, for example) into a variable like xmlDoc below and then getting the nodeValue for the 2nd occurence ( item(1) ) of the content tag

xmlDoc.getElementsByTagName("content").item(1).childNodes[0].nodeValue

or (both notations are equivalent)

xmlDoc.getElementsByTagName("content")[1].childNodes[0].nodeValue

回答5:

I don't know what XML builder you're using, but PHP (actually libxml) knows how to handle ]]> inside CDATA sections, and so should every other XML framework. So, I'd use a CDATA section.

回答6:

It makes sense to wrap HTML in CDATA. The HTML text will probably constitute on single value in XML.

So not wrapping it in CDATA will cause all xml parsers to read it as a part of the XML document. While it is easy to circumvent this problem while using the xml, why the extra headache?

If you want to actually parse the HTML into a DOM, then its better to read the HTML text, and setup a parser to read the test separately.

Hope that came out the way I intended it to.

回答7:

Personally, I hate CDATA segments, so I'd use encoding instead. Of course, if you add XML to XML to XML then this would result in encoding over encoding over encoding and thus some very unreadable results. Why I hate CDATA segments? I wish I knew. Personal preference, mostly. I just don't like getting used to adding "forbidden characters" inside a special segment where they would suddenly be allowed again. It just confuses me when I see XML mark-up within a CDATA segment and it's not part of the XML surrounding it. At least with encoding I will see that it's encoded.

Good XML libraries will handle both encoding and CDATA segments transparently. It's just my eyes that get hurt.

回答8:

Encoding it will work fine and is reliable. You can encode encoded sections etc. without any difficulty.

Decoding will be done automatically by whatever XML parser is used to handle your encoded HTML.

回答9:

i think the answer depends on what you are planning to do with the html content, and also what type of html content you plan to support.

Especially when it comes to included javascript, encoding often results in problems. CDATA definitely helps you there.

If you plan to use only small snippets (ie. a paragraph) and have a way to preprocess/filter it (because oyu dont want javascript or fancy things anyways), you will probably be better off with encoding or actually just putting it directly as subtree in the xml. You can then also post-process the html (ie filter style or onclick attributes). But this is definitely more work.

回答10:

You can use combination of both. For example: you want to pass <h1>....</h1> in xml node you have use CDATA section to pass it. Contents inside <h1>...</h1> must be encoded to html entities like e.g. <, for <. Encoding between tags will solve the problem of ]]> getting interprited as it gets converted to ]]> and html tags do not contain ]]>.

You can do this only if html is generated by yourself.

回答11:

If your HTML is well-formed, then just embed the HTML tags without escaping or wrapping in CDTATA. If at all possible, it helps to keep your content in XML. It gives you more flexibility for transforming and manipulating the document.

You could set a namespace for the HTML, so that you could disambiguate your HTML tags from the other XML wrapping it.

Escaped text means that the entire HTML block will be one big text node. Wrapping in CDATA tells the XML parser not to parse that section. It may be "easier", but limits your abilities downrange and should only be employed when appropriate; not just because it is more convenient. Escaped markup is considered harmful.

来源：https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html

标签

xml

cdata

html-encode