Is there a way to escape a CDATA end token in xml?

不问归期 提交于 2019-11-26 11:42:55
ddaa

Clearly, this question is purely academic. Fortunately, it has a very definite answer.

You cannot escape a CDATA end sequence. Production rule 20 of the XML specification is quite clear:

[20]    CData      ::=      (Char* - (Char* ']]>' Char*))

EDIT: This product rule literally means "A CData section may contain anything you want BUT the sequence ']]>'. No exception.".

EDIT2: The same section also reads:

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using "<" and "&". CDATA sections cannot nest.

In other words, it's not possible to use entity reference, markup or any other form of interpreted syntax. The only parsed text inside a CDATA section is ]]>, and it terminates the section.

Hence, it is not possible to escape ]]> within a CDATA section.

EDIT3: The same section also reads:

2.7 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>":]

Then there may be a CDATA section anywhere character data may occur, including multiple adjacent CDATA sections inplace of a single CDATA section. That allows it to be possible to split the ]]> token and put the two parts of it in adjacent CDATA sections.

ex:

<![CDATA[Certain tokens like ]]> can be difficult and <invalid>]]> 

should be written as

<![CDATA[Certain tokens like ]]]]><![CDATA[> can be difficult and <valid>]]> 

You have to break your data into pieces to conceal the ]]>.

Here's the whole thing:

<![CDATA[]]]]><![CDATA[>]]>

The first <![CDATA[]]]]> has the ]]. The second <![CDATA[>]]> has the >.

You do not escape the ]]> but you escape the > after ]] by inserting ]]><![CDATA[ before the >, think of this just like a \ in C/Java/PHP/Perl string but only needed before a > and after a ]].

BTW,

S.Lott's answer is the same as this, just worded differently.

S. Lott's answer is right: you don't encode the end tag, you break it across multiple CDATA sections.

How to run across this problem in the real world: using an XML editor to create an XML document that will be fed into a content-management system, try to write an article about CDATA sections. Your ordinary trick of embedding code samples in a CDATA section will fail you here. You can imagine how I learned this.

But under most circumstances, you won't encounter this, and here's why: if you want to store (say) the text of an XML document as the content of an XML element, you'll probably use a DOM method, e.g.:

XmlElement elm = doc.CreateElement("foo");
elm.InnerText = "<[CDATA[[Is this a problem?]]>";

And the DOM quite reasonably escapes the < and the >, which means that you haven't inadvertently embedded a CDATA section in your document.

Oh, and this is interesting:

XmlDocument doc = new XmlDocument();

XmlElement elm = doc.CreateElement("doc");
doc.AppendChild(elm);

string data = "<![[CDATA[This is an embedded CDATA section]]>";
XmlCDataSection cdata = doc.CreateCDataSection(data);
elm.AppendChild(cdata);

This is probably an ideosyncrasy of the .NET DOM, but that doesn't throw an exception. The exception gets thrown here:

Console.Write(doc.OuterXml);

I'd guess that what's happening under the hood is that the XmlDocument is using an XmlWriter produce its output, and the XmlWriter checks for well-formedness as it writes.

simply replace ]]> with ]]]]><![CDATA[>

Here's another case in which ]]> needs to be escaped. Suppose we need to save a perfectly valid HTML document inside a CDATA block of an XML document and the HTML source happens to have it's own CDATA block. For example:

<htmlSource><![CDATA[ 
    ... html ...
    <script type="text/javascript">
        /* <![CDATA[ */
        -- some working javascript --
        /* ]]> */
    </script>
    ... html ...
]]></htmlSource>

the commented CDATA suffix needs to be changed to:

        /* ]]]]><![CDATA[> *//

since an XML parser isn't going to know how to handle javascript comment blocks

In PHP: '<![CDATA['.implode(explode(']]>', $string), ']]]]><![CDATA[>').']]>'

A cleaner way in PHP:

   function safeCData($string)
   {
      return '<![CDATA[' . str_replace(']]>', ']]]]><![CDATA[>', $string) . ']]>';
   }

Don't forget to use a multibyte-safe str_replace if required (non latin1 $string):

   function mb_str_replace($search, $replace, $subject, &$count = 0)
   {
      if (!is_array($subject))
      {
         $searches = is_array($search) ? array_values($search) : array ($search);
         $replacements = is_array($replace) ? array_values($replace) : array ($replace);
         $replacements = array_pad($replacements, count($searches), '');
         foreach ($searches as $key => $search)
         {
            $parts = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
         }
      }
      else
      {
         foreach ($subject as $key => $value)
         {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
         }
      }
      return $subject;
   }

Another solution is to replace ]]> by ]]]><![CDATA[]>.

See this structure:

<![CDATA[
   <![CDATA[
      <div>Hello World</div>
   ]]]]><![CDATA[>
]]>

For the inner CDATA tag(s) you must close with ]]]]><![CDATA[> instead of ]]>. Simple as that.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!