Best web scraping Ruby on Rails library that handles dynamic HTML produced by javascript [closed]

可紊 提交于 2019-12-11 02:46:20

问题


I am using Ruby on Rails with the Mechanize library to scrape store websites. The problem is that many times I can't crawl certain elements. However, I can see this when I 'view source' on the site.

For example, Walmart's category (in this case below it is "Health") is unscapeable. I believe this is because it is dynamically produced HTML (e.g. from javascript). In order to scrape this, I need a browser to process the web request.

http://www.walmart.com/ip/Replacement-Sensor-Module-for-AlcoMate-Prestige-Breathalyzer/10167376

I am also using a linux machine on Amazon EC2. It would be tough to install browser for UI scraping. Is there any Rails gem/plugin that can help me?

Thanks, all!!


回答1:


Your question, rephrased, is, what's an easy way to parse an HTML document's DOM in the same way a web browser would, then execute the JavaScript in the document against the parsed DOM? Without running an actual web browser.

That's a little tricky.

However, all is not lost. Take a look at Capybara. Though created for acceptance testing you can also use it for general grokking of documents. To execute JavaScript you'll need to use a driver that supports it, and since you want it to be "headless" (no browser GUI) that probably means using capybara-webkit, Akephalos or capybara-envjs.

Another option might be Harmony, which I know nothing about except that it appears to do what you want but also appears not to be maintained anymore, so YMMV.



来源:https://stackoverflow.com/questions/8484305/best-web-scraping-ruby-on-rails-library-that-handles-dynamic-html-produced-by-ja

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!