Performant parsing of HTML pages with Node.js and XPath

前端 未结 6 2118
情书的邮戳
情书的邮戳 2020-12-07 21:10

I\'m into some web scraping with Node.js. I\'d like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way t

6条回答
  •  离开以前
    2020-12-07 21:30

    You can do so in several steps.

    1. Parse HTML with parse5. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
    2. Serialize it to XHTML with xmlserializer that accepts DOM-like structures of parse5 as input.
    3. Parse that XHTML again with xmldom. Now you finally have that DOM.
    4. The xpath library builds upon xmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a won't work.

    Finally you get something like this.

    const fs = require('mz/fs');
    const xpath = require('xpath');
    const parse5 = require('parse5');
    const xmlser = require('xmlserializer');
    const dom = require('xmldom').DOMParser;
    
    (async () => {
        const html = await fs.readFile('./test.htm');
        const document = parse5.parse(html.toString());
        const xhtml = xmlser.serializeToString(document);
        const doc = new dom().parseFromString(xhtml);
        const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
        const nodes = select("//x:a/@href", doc);
        console.log(nodes);
    })();
    

提交回复
热议问题