Performant parsing of HTML pages with Node.js and XPath

前端 未结 6 2132
情书的邮戳
情书的邮戳 2020-12-07 21:10

I\'m into some web scraping with Node.js. I\'d like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way t

6条回答
  •  南笙
    南笙 (楼主)
    2020-12-07 21:38

    Libxmljs is currently the fastest implementation (something like a benchmark) since it's only bindings to the LibXML C-library which supports XPath 1.0 queries:

    var libxmljs = require("libxmljs");
    var xmlDoc = libxmljs.parseXml(xml);
    // xpath queries
    var gchild = xmlDoc.get('//grandchild');
    

    However, you need to sanitize your HTML first and convert it to proper XML. For that you could either use the HTMLTidy command line utility (tidy -q -asxml input.html), or if you want it to keep node-only, something like xmlserializer should do the trick.

提交回复
热议问题