Is it possible to plug a JavaScript engine with Ruby and Nokogiri?

孤人 提交于 2019-12-20 15:34:09

问题


I'm writing an application to crawl some websites and scrape data from them. I'm using Ruby, Curl and Nokogiri to do this. In most cases it's straightforward and I only need to ping a URL and parse the HTML data. The setup works perfectly fine.

However, in some scenarios, the websites retrieve data based on user input on some radio buttons. This invokes some JavaScript which fetches some more data from the server. The generated URL and posted data is determined by JavaScript code.

Is it possible to use:

  1. A JavaScript library along with this setup which would be able to determine execute the JavaScript in the HTML page for me?

  2. Apart from using a different library, is there some integration or a way for the HTML and JS libraries to communicate? For instance if a button is clicked, Nokogiri needs to call JavaScript and then the JavaScript needs to update Nokogiri.

In case my approach doesn't seem the best, what would your suggestion be to build a crawler + scraper on the web using Ruby.

EDIT: Looks like point 1 is possible using therubyrace as it embeds the V8 engine in your code, but is there an alternative to 2?


回答1:


You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There's a similar project called Selenium.

You can even use Watir with a so-called 'headless' browser on a linux machine.

Watir headless example

Suppose we have this HTML:

<p id="hello">Hello from HTML</p>

and this Javascript:

document.getElementById('hello').innerHTML = 'Hello from JavaScript';

(Demo: http://jsbin.com/ivihur)

and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb and firefox installed, for example on Ubuntu do:

$ apt-get install xvfb firefox

You will also need the watir-webdriver and headless gems so go ahead and install them as well:

$ gem install watir-webdriver headless

Then you can read the dynamic content from the page with something like this:

require 'rubygems'
require 'watir-webdriver'
require 'headless'

headless = Headless.new
headless.start
browser = Watir::Browser.new

browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text

browser.close
headless.destroy

If everything went right, this will output:

Hello from JavaScript

I know this runs a browser in the background as well, but it's the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).

Source: http://watirwebdriver.com/headless/




回答2:


Capybara + PhantomJS

My favorite Ruby-controlled headless browser is PhantomJS. PhantomJS is a headless WebKit-based browser. It includes Poltergeist which is a driver for Capybara.

In summary, the stack looks like this:

Capybara -> Poltergeist -> PhantomJS -> WebKit

Note that you can use PhantomJS directly with selenium-webdriver, but the Capybara API is nicer (IMHO).

Being a minimal WebKit implementation, PhantomJS has a faster startup time than a full browser like Chrome or IE.

Sample code to scrape google result links:

module Test
  class Google
    include Capybara::DSL

    def get_results
      visit('/')
      fill_in "q", :with => "Capybara"
      click_button "Google Search"
      all(:xpath, "//li[@class='g']/h3/a").each { |a| puts a[:href] }

    end
  end
end

scraper = Test::Google.new
scraper.get_results

In addition to the standard Capybara features, Poltergeist can do some very convenient things:

  • Inject and run your own javascript with page.evaluate_script and page.execute_script
  • page.within_frame and page.within_window
  • page.status_code and page.response_headers
  • page.save_screenshot <- This is really nice when things go wrong!
  • page.driver.render_base64(format, options)
  • page.driver.scroll_to(left, top)
  • page.driver.basic_authorize(user, password)
  • element.native.send_keys(*keys)
  • cookie handling
  • drag-and-drop

These features are listed on the Poltergeist GitHub page: https://github.com/teampoltergeist/poltergeist.

Celerity

If you really want to eke out as much performance as possible, and don't mind switching to JRuby to do so, I have found Celerity to be super fast.

Celerity is a wrapper around Java's HTMLUnit. It is speedy because HTMLUnit is not a full browser, it is more of an emulator that executes JavaScript. The downside is that it doesn't support all the JavaScript that a full browser does, so it won't support very JS-heavy sites, but it is sufficient for most sites and getting better all the time.

Another advantage is the multithreaded nature of JRuby. With the Peach (parallel each) gem, you can fire off many browsers in parallel. I have done this with a test suite in the past and drastically reduced the time to finish. In fact, we made a load tester using Celerity + Peach that was much more sophisticated than your typical JMeter, Grinder, apachebench, etc. It could really exercise our site!



来源:https://stackoverflow.com/questions/11494994/is-it-possible-to-plug-a-javascript-engine-with-ruby-and-nokogiri

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!