How do I get content from a website using Ruby / Rails?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-21 05:37:30

问题


I want to copy some specific content from a website using ruby/rails. The content I need is inside a marquee html tag, divided by divs. How can I get access to this content using ruby? To be more precise - I want to use some kind of ruby gui (Preferably shoes). How do I do it?


回答1:


This isn't really a Rails question. It's something you'd do using Ruby, then possibly display using Rails, or Sinatra or Padrino - pick your poison.

There are several different HTTP clients you can use:

Open-URI comes with Ruby and is the easiest. Net::HTTP comes with Ruby and is the standard toolbox, but it's lower-level so you'd have to do more work. HTTPClient and Typhoeus+Hydra are capable of threading and have both high-level and low-level interfaces.

I recommend using Nokogiri to parse the returned HTML. It's very full-featured and robust.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.example.com'))

puts doc.to_html

If you need to navigate through login screens or fill in forms before you get to the page you need to parse, then I'd recommend looking at Mechanize. It relies on Nokogiri internally so you can ask it for a Nokogiri document and parse away once Mechanize retrieves the desired URL.

If you need to deal with Dynamic HTML, then look into the various WATIR tools. They drive various web browsers then let you access the content as seen by the browser.

Once you have the content or data you want, you can "repurpose" it into text inside a Rails page.




回答2:


If I'm to understand correctly, you want a GUI interface to a website scraper. If that's so, you might have to build one yourself.

The easiest way to scrape a website is using nokogiri or mechanize gems. Basically, you will give those libraries the address of the website and then use their XPath capabilities to select the text out of the DOM.

https://github.com/sparklemotion/nokogiri

https://github.com/sparklemotion/mechanize (for the documentation)



来源:https://stackoverflow.com/questions/5250547/how-do-i-get-content-from-a-website-using-ruby-rails

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!