When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

问题

I'm trying to do some web scraping with node.js. Using jsdom, it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements.

Thus far, I get NaN when I try to inspect the dimensions of DOM elements with jsdom.

Is this possible?

It strikes me that there are two distinct challenges:

Running all the JS on the web page
Getting Node to simulate the window/screen rendering in addition to just the DOM

Another way to ask the question: is it possible to use node.js as a completely headless browser that you can script?

If this isn't possible, does anyone have suggestions for what library I can use to do this? I'm relatively language agnostic.

回答1:

Take a look at PhantomJS. Incredibly simple to use.

http://www.phantomjs.org/

PhantomJS is a command-line tool that packs and embeds WebKit. Literally it acts like any other WebKit-based web browser, except that nothing gets displayed to the screen (thus, the term headless). In addition to that, PhantomJS can be controlled or scripted using its JavaScript API.

回答2:

You can use:

htmlunit (java, jython)
PyQtWebKit or pygtk + webkit (python)
WWW::Mechanize::Firefox to scrape from firefox (perl)
Win32-IEAutomation to scrape from MS internet explorer (perl)

All those solutions can run javascript as well.

You will find many sample code right from http://stackoverflow.com searches

来源：https://stackoverflow.com/questions/7842507/when-web-scraping-with-node-js-can-i-run-all-javascripts-on-the-page-i-e-si

标签

node.js

screen-scraping