HTML-parser on Node.js [closed]

守給你的承諾、 提交于 2019-11-26 13:53:17
Farid Nouri Neshat

If you want to build DOM you can use jsdom.

There's also cheerio, it has the jQuery interface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.

You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.

parse5 also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.

And if you want to parse HTML for web scraping, you can use YQL. There is a node module for it. YQL I think would be the best solution if your HTML is from a static website, since you are relying on a service, not your own code and processing power. Though note that it won't work if the page is disallowed by the robot.txt of the website, YQL won't work with it.

If the website you're trying to scrape is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.

Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.

There's a nettuts+ toturial for the latter solutions.

Try https://github.com/tmpvar/jsdom - you give it some HTML and it gives you a DOM.

You can also take a look at x-ray: https://github.com/lapwinglabs/x-ray

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!