Best way to “fix” malformed html for use in an xsl transform

拟墨画扇 提交于 2019-12-13 04:37:51

问题


I have an input xml document that contains mal-formed html which has been xml encoded. i.e. the xml document itself is technically valid.

Now I am applying an xsl transform to the xml which output well-formed xhtml5 but contains the mal-formed html.

Examples of the bad html:

  • html, head and body tags in html fragments.
  • font tags
  • mismatched quotes
  • unclosed tags
  • extra close tags with no matching open
  • close tags in the wrong order (e.g. <b><u>text</b></u>)

Now in my situation I actually don't care that the html is mal-formed - I only care that my closing tags match my opening tags, regardless of what goes in between.

So my question is - what is the best way to either

  1. Clean up the html sufficiently that it does not affect other tags (preferably from within the transform itself)
  2. or somehow mark a closetag so that html5 compatible browsers recognise it as matching a particular open tag regardless of whatever nasty markup may be in between.

for 2. I have no ideas at all. I have a couple of ideas for 1. such as calling an external tool like tidy or using a .NET sgml parser

.NET xsl scripts (msxsl:script) are acceptable, if undesirable.

Example source:

<xml>
  &lt;b&gt;&lt;u&gt;bad html&lt;/b&gt;&lt;/u&gt;
<xml>

Example output:

<div id="MyDiv">
  <b><u>bad html</b></u>
</div> <!-- this /div absolutly must match the opening div regardless of what might be in the bad html -->

What other approaches are available?

C#, VS2012, xslt 1.0 only


回答1:


Is using a third party library acceptable? The HTML Agility Pack (available on NuGet) might got part of the way to solving your invalid HTML and it also (according to the website) supports XSLT.




回答2:


I went for using a sgml parsing library and converting to valid xml.

I went for Mind Touch's library: https://github.com/MindTouch/SGMLReader

Once compiled and added to the GAC I could use this xsl:

<msxsl:script language="C#" implements-prefix="myns">
  <msxsl:assembly name="SgmlReaderDll, Version=1.8.11.0, Culture=neutral, PublicKeyToken=46b2db9ca481831b"/>
    <![CDATA[
 public XPathNodeIterator SGMLStringToXml(string strSGML)
 {
 Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
 sgmlReader.DocType = "HTML";
 sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
 sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
 sgmlReader.InputStream = new System.IO.StringReader(strSGML);

 // create document
 XmlDocument doc = new XmlDocument();
 doc.PreserveWhitespace = true;
 doc.XmlResolver = null;
 doc.Load(sgmlReader);
 return doc.CreateNavigator().Select("/*");
 }

 public string CurDir()
 {
 return (new System.IO.DirectoryInfo(".")).FullName;
 }
  ]]>

</msxsl:script>
<xsl:template match="node()" mode="PreventSelfClosingTags">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
    <xsl:text> </xsl:text>
  </xsl:copy>
</xsl:template>
<xsl:template match="@*" mode="PreventSelfClosingTags">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

and use it like so:

<xsl:apply-templates select="myns:SGMLStringToXml(.)/body/*" mode="PreventSelfClosingTags"/>

N.B. You have to run the transform manually with an XslCompiledTransform instance. The asp:xml control doesn't like the DLL reference.



来源:https://stackoverflow.com/questions/18872689/best-way-to-fix-malformed-html-for-use-in-an-xsl-transform

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!