I\'m into some web scraping with Node.js. I\'d like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way t
You can do so in several steps.
parse5
. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.xmlserializer
that accepts DOM-like structures of parse5
as input.xmldom
. Now you finally have that DOM.xpath
library builds upon xmldom
, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a
won't work.Finally you get something like this.
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
(async () => {
const html = await fs.readFile('./test.htm');
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:a/@href", doc);
console.log(nodes);
})();