What is the best way to parse a web page in Ruby?

前端 未结 6 1819
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-24 08:41

I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how

相关标签:
6条回答
  • 2020-12-24 09:05

    try hpricot, its well... awesome

    I've used it several times for screen scraping.

    0 讨论(0)
  • 2020-12-24 09:10

    Something I ran into trying to do this before is that few web pages are well-formed XML documents. Hpricot may be able to deal with that (I haven't used it) but when I was doing a similar project in the past (using Python and its library's built in parsing functions) it helped to have a pre-processor to clean up the HTML. I used the python bindings for HTML Tidy as this and it made life a lot easier. Ruby bindings are here but I haven't tried them.

    Good luck!

    0 讨论(0)
  • 2020-12-24 09:19

    Hpricot is over !

    Use Nokogiri now.

    0 讨论(0)
  • 2020-12-24 09:25

    Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.

    require 'hpricot'
    require 'open-uri'
    
    doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
    reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i
    

    And so forth.

    0 讨论(0)
  • 2020-12-24 09:29

    I always really like what Ilya Grigorik writes, and he wrote up a nice post about using hpricot.

    I also read this post a while back and it looks like it would be useful for you.

    Haven't done either myself, so YMMV but these seem pretty useful.

    0 讨论(0)
  • 2020-12-24 09:30

    it seems to be an old topic but here is a new one. Example getting reputation:

    #!/usr/bin/env ruby
    
    require 'rubygems'
    require 'hpricot'
    require 'open-uri'
    
    user = "619673/100kg"
    html = "http://stackoverflow.com/users/%s?tab=reputation"
    
    page = html % user
    puts page
    
    doc = Hpricot(open(page))
    pars = Array.new
    doc.search("div[@class='subheader user-full-tab-header']/h1/span[@class='count']").text.each do |p|
      pars << p
    end
    
    puts "reputation " + pars[0]
    
    0 讨论(0)
提交回复
热议问题