Performant parsing of HTML pages with Node.js and XPath

前端未结

关注

 6  2132

情书的邮戳 2020-12-07 21:10

I\'m into some web scraping with Node.js. I\'d like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way t

6条回答

南笙 (楼主)

2020-12-07 21:38
Libxmljs is currently the fastest implementation (something like a benchmark) since it's only bindings to the LibXML C-library which supports XPath 1.0 queries:
```
var libxmljs = require("libxmljs");
var xmlDoc = libxmljs.parseXml(xml);
// xpath queries
var gchild = xmlDoc.get('//grandchild');
```
However, you need to sanitize your HTML first and convert it to proper XML. For that you could either use the HTMLTidy command line utility (tidy -q -asxml input.html), or if you want it to keep node-only, something like xmlserializer should do the trick.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...