When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

谁说胖子不能爱 提交于 2019-12-22 05:59:25

问题


I'm trying to do some web scraping with node.js. Using jsdom, it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements.

Thus far, I get NaN when I try to inspect the dimensions of DOM elements with jsdom.

Is this possible?

It strikes me that there are two distinct challenges:

  1. Running all the JS on the web page
  2. Getting Node to simulate the window/screen rendering in addition to just the DOM

Another way to ask the question: is it possible to use node.js as a completely headless browser that you can script?

If this isn't possible, does anyone have suggestions for what library I can use to do this? I'm relatively language agnostic.


回答1:


Take a look at PhantomJS. Incredibly simple to use.

http://www.phantomjs.org/

PhantomJS is a command-line tool that packs and embeds WebKit. Literally it acts like any other WebKit-based web browser, except that nothing gets displayed to the screen (thus, the term headless). In addition to that, PhantomJS can be controlled or scripted using its JavaScript API.




回答2:


You can use:

  • htmlunit (java, jython)
  • PyQtWebKit or pygtk + webkit (python)
  • WWW::Mechanize::Firefox to scrape from firefox (perl)
  • Win32-IEAutomation to scrape from MS internet explorer (perl)

All those solutions can run javascript as well.

You will find many sample code right from http://stackoverflow.com searches



来源:https://stackoverflow.com/questions/7842507/when-web-scraping-with-node-js-can-i-run-all-javascripts-on-the-page-i-e-si

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!