I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how
try hpricot, its well... awesome
I've used it several times for screen scraping.
Something I ran into trying to do this before is that few web pages are well-formed XML documents. Hpricot may be able to deal with that (I haven't used it) but when I was doing a similar project in the past (using Python and its library's built in parsing functions) it helped to have a pre-processor to clean up the HTML. I used the python bindings for HTML Tidy as this and it made life a lot easier. Ruby bindings are here but I haven't tried them.
Good luck!
Hpricot is over !
Use Nokogiri now.
Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.
require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i
And so forth.
I always really like what Ilya Grigorik writes, and he wrote up a nice post about using hpricot.
I also read this post a while back and it looks like it would be useful for you.
Haven't done either myself, so YMMV but these seem pretty useful.
it seems to be an old topic but here is a new one. Example getting reputation:
#!/usr/bin/env ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
user = "619673/100kg"
html = "http://stackoverflow.com/users/%s?tab=reputation"
page = html % user
puts page
doc = Hpricot(open(page))
pars = Array.new
doc.search("div[@class='subheader user-full-tab-header']/h1/span[@class='count']").text.each do |p|
pars << p
end
puts "reputation " + pars[0]