Extract data from HTML Table with mechanize

问题

First of all, here is the sample html table :

 <tr>
   <td><strong>Kangchenjunga </strong></td>
   <td>8,586m<br /></td>
   <td>28,169ft</td>
   <td><div align="center">Nepal/India </div></td>
   <td>1955; G. Band, J. Brown </td>
 </tr>

The ARGV[0] will have the name of a mountain ( the first colomn) and the return value should be the last column, the people who climbed the mountain for the first time.

So I need to check if the whole rows first column is the ARGV[0], and if it is, then I should return the last column without the date.

require 'mechanize'
p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body
if p.include?('<strong>'+ARGV[0])
   puts 'ok'
end

I've got the following, which prints "ok" if I have the ARGV[0] in the body of the html document. How can I search for the last column of the same row, where the ARGV[0] is found?

EXAMPLE :

<tr>
 <td><strong>GIVE THIS AS A PARAMETER </strong></td>
 <td>SKIP THIS<br /></td>
 <td>SKIP THIS</td>
 <td><div align="center">SKIP THIS</div></td>
 <td>I WANT IT TO RETURN THIS</td>
</tr>

I'm really new to Ruby

回答1:

More succint version relying more on the black magic of XPath :)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
last_td = doc./("//tr[td[strong[text()='#{ARGV[0]}']]]/td[5]")

puts last_td.text.gsub(/.*?;/, '').strip

回答2:

I believe this is what you want (you will need to gem install nokogiri)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
rows = doc.search('//table')[6]./('tr')
rows.shift
rows.shift

rows.each do |row|
  if row.text.include? ARGV[0]
    puts row./('td')[4].text.gsub(/.*?;/, '').strip   
  end
end

回答3:

The first mistake that I see is that you are calling the following:

p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body

Unfortunately grabbing the body from the mechanize object will just return all the body text as you would find in the DOCTYPE body block.

This information is quite annoying to parse through so I would recommend doing the following. p=Mechanize.new.get('http://www.alpineascents.com/8000m-peaks.asp')

This will return a Mechanize#Page object which you an play with(http://mechanize.rubyforge.org/Mechanize/Page.html)

With that object we can simply perform a search which is nokogiris search by doing the following;

elems = p.search('tr')

this will return all the tr elements as a Nokogiri::XML::Element which we can use pretty cleanly to get the information that we want. Note that you may want to play around with all the stuff in IRB to figure out exactly what you need but the idea is should be clear from the following:

elems.first.search('td').last.text which will return the final td elements text from the first tr element we searched for before.

If you have any questions / want me to clarify feel free to ask away.

I have been hacking on things with mechanize for a long while now.

EDIT:

If you want to be able to look up the values this using some argument this is how I imagined you would solve the problem

values = {}
elems.each do |e|
  td = e.search('td')
  values[td.first.text] = td.last.text
end

When you have the values hash filled you can do the following:

if ARG[0] = "Everest"

then

> values["Everest"] => "1953; Sir E. Hillary, T. Norgay"

来源：https://stackoverflow.com/questions/23500507/extract-data-from-html-table-with-mechanize

标签

html

ruby-on-rails

ruby

parsing

mechanize