Finding a node (or close to it) using XPath in non well-formed HTML

前端 未结 2 860
囚心锁ツ
囚心锁ツ 2020-12-06 14:22

I\'m using XPath to locate a node (or something close to it) in a template that has non-well-formed HTML about 10 levels deep. (No I didn\'t write this HTML...but I\'ve bee

相关标签:
2条回答
  • 2020-12-06 15:01

    XPath does not work directly with HTML. The interaction of XPath with your HTML is dictacted by whatever software/library is parsing the HTML into a rendering tree. This may help direct your search appropriately.

    0 讨论(0)
  • 2020-12-06 15:13

    XPath expressions cannot be evaluated agaist a non-wellformed XML document, which is exactly the described case.

    It is possible to do this in two chained steps, the first of which is to convert the HTML to wellformed XML and then the second -- to apply the XPath expression.

    Therefore, the question could be more precisely stated as "How to convert HTML to XML so that XPath expressions can be evaluated against it".

    Here are two good tools:

    1. TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML. Taggle is a commercial C++ port of TagSoup.

    2. SgmlReader is a tool developed by Microsoft's Chris Lovett. SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result. Download the zip file including the standalone executable and the full source code: SgmlReader.zip

    3. The pure XSLT 2.0 Parser of HTML written by David Carlisle. Reading its code would be a great learning exercise for everyone of us.

    From the description:

    "d:htmlparse(string) d:htmlparse(string,namespace,html-mode)

    The one argument form is equivalent to) d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))

    Parses the string as HTML and/or XML using some inbuilt heuristics to) control implied opening and closing of elements.

    It doesn't have full knowledge of HTML DTD but does have full list of empty elements and full list of entity definitions. HTML entities, and decimal and hex character references are all accepted. Note html-entities are recognised even if html-mode=false().

    Element names are lowercased (if html-mode is true()) and placed into the namespace specified by the namespace parameter (which may be "" to denote no-namespace unless the input has explict namespace declarations, in which case these will be honoured.

    Attribute names are lowercased if html-mode=true()"

    Read a more detailed description here.

    0 讨论(0)
提交回复
热议问题