问题
I'm trying to do some web scraping with node.js. Using jsdom
, it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements.
Thus far, I get NaN
when I try to inspect the dimensions of DOM elements with jsdom.
Is this possible?
It strikes me that there are two distinct challenges:
- Running all the JS on the web page
- Getting Node to simulate the window/screen rendering in addition to just the DOM
Another way to ask the question: is it possible to use node.js as a completely headless browser that you can script?
If this isn't possible, does anyone have suggestions for what library I can use to do this? I'm relatively language agnostic.
回答1:
Take a look at PhantomJS. Incredibly simple to use.
http://www.phantomjs.org/
PhantomJS is a command-line tool that packs and embeds WebKit. Literally it acts like any other WebKit-based web browser, except that nothing gets displayed to the screen (thus, the term headless). In addition to that, PhantomJS can be controlled or scripted using its JavaScript API.
回答2:
You can use:
- htmlunit (java, jython)
- PyQtWebKit or pygtk + webkit (python)
- WWW::Mechanize::Firefox to scrape from firefox (perl)
- Win32-IEAutomation to scrape from MS internet explorer (perl)
All those solutions can run javascript as well.
You will find many sample code right from http://stackoverflow.com searches
来源:https://stackoverflow.com/questions/7842507/when-web-scraping-with-node-js-can-i-run-all-javascripts-on-the-page-i-e-si