Performant parsing of HTML pages with Node.js and XPath

前端未结

关注

 6  2118

情书的邮戳 2020-12-07 21:10

I\'m into some web scraping with Node.js. I\'d like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way t

6条回答

离开以前 (楼主)

2020-12-07 21:30
You can do so in several steps.
1. Parse HTML with parse5. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
2. Serialize it to XHTML with xmlserializer that accepts DOM-like structures of parse5 as input.
3. Parse that XHTML again with xmldom. Now you finally have that DOM.
4. The xpath library builds upon xmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a won't work.
Finally you get something like this.
```
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;

(async () => {
    const html = await fs.readFile('./test.htm');
    const document = parse5.parse(html.toString());
    const xhtml = xmlser.serializeToString(document);
    const doc = new dom().parseFromString(xhtml);
    const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    const nodes = select("//x:a/@href", doc);
    console.log(nodes);
})();
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...