Web crawler that can interpret JavaScript [closed]

天涯浪子 提交于 2019-12-25 07:39:11

问题


I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though Firebug.

How would I go about doing this? What tools exist that would help me?


回答1:


Ruby's Capybara is an integration test library, but it can also be used to write stand-alone web-crawlers. Given that it uses backends like Selenium or headless WebKit, it interprets javascript out-of-the-box:

require 'capybara/dsl'
require 'capybara-webkit'

include Capybara::DSL
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.google.com"
page.visit("/")
puts(page.html)



回答2:


I've been using HtmlUnit (Java). This was originally designed for unit testing pages. It's not perfect javascript, but it hasn't failed me in my limited usage. According to the site, it can run the following JS frameworks to a reasonable degree:

  • jQuery 1.2.6
  • MochiKit 1.4.1
  • GWT 2.0.0
  • Sarissa 0.9.9.3
  • MooTools 1.2.1
  • Prototype 1.6.0
  • Ext JS 2.2
  • Dojo 1.0.2
  • YUI 2.3.0



回答3:


You are more likely to have success in Java than in PHP. There is a pre-existing Javascript interpreter for Java called Rhino. It's a reference implementation, and well-documented.

Rhino is used in lots of existing Java apps to provide Javascript scripting ability within the app. I have also heard of it used to assist with performing automated tests in Javascript.

I also know that Java includes code that can parse and render HTML, though someone who knows more about Java than me can probably advise more on that. I am not denying it would be very difficult to achieve something like this; you'd essentially be re-implementing a lot of what a browser does.




回答4:


You could use Mozilla's rendering engine Gecko:

https://developer.mozilla.org/en/Gecko




回答5:


Give a look here: http://snippets.scrapy.org/snippets/22/ it's a python screen scraping and web crawling framework used with webdrivers that open a page, render all the things you need and gives you the possibilities to "capture" anything you want in the page via



来源:https://stackoverflow.com/questions/16870247/is-there-an-open-source-spider-crawler-that-supports-build-in-javascript-executi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!