Extract data from HTML Table with mechanize

时间秒杀一切 提交于 2019-12-30 11:41:29

问题


First of all, here is the sample html table :

 <tr>
   <td><strong>Kangchenjunga </strong></td>
   <td>8,586m<br /></td>
   <td>28,169ft</td>
   <td><div align="center">Nepal/India </div></td>
   <td>1955; G. Band, J. Brown </td>
 </tr>

The ARGV[0] will have the name of a mountain ( the first colomn) and the return value should be the last column, the people who climbed the mountain for the first time.

So I need to check if the whole rows first column is the ARGV[0], and if it is, then I should return the last column without the date.

require 'mechanize'
p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body
if p.include?('<strong>'+ARGV[0])
   puts 'ok'
end

I've got the following, which prints "ok" if I have the ARGV[0] in the body of the html document. How can I search for the last column of the same row, where the ARGV[0] is found?

EXAMPLE :

<tr>
 <td><strong>GIVE THIS AS A PARAMETER </strong></td>
 <td>SKIP THIS<br /></td>
 <td>SKIP THIS</td>
 <td><div align="center">SKIP THIS</div></td>
 <td>I WANT IT TO RETURN THIS</td>
</tr>

I'm really new to Ruby


回答1:


More succint version relying more on the black magic of XPath :)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
last_td = doc./("//tr[td[strong[text()='#{ARGV[0]}']]]/td[5]")

puts last_td.text.gsub(/.*?;/, '').strip



回答2:


I believe this is what you want (you will need to gem install nokogiri)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
rows = doc.search('//table')[6]./('tr')
rows.shift
rows.shift

rows.each do |row|
  if row.text.include? ARGV[0]
    puts row./('td')[4].text.gsub(/.*?;/, '').strip   
  end
end



回答3:


The first mistake that I see is that you are calling the following:

p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body

Unfortunately grabbing the body from the mechanize object will just return all the body text as you would find in the DOCTYPE body block.

This information is quite annoying to parse through so I would recommend doing the following. p=Mechanize.new.get('http://www.alpineascents.com/8000m-peaks.asp')

This will return a Mechanize#Page object which you an play with(http://mechanize.rubyforge.org/Mechanize/Page.html)

With that object we can simply perform a search which is nokogiris search by doing the following;

elems = p.search('tr')

this will return all the tr elements as a Nokogiri::XML::Element which we can use pretty cleanly to get the information that we want. Note that you may want to play around with all the stuff in IRB to figure out exactly what you need but the idea is should be clear from the following:

elems.first.search('td').last.text which will return the final td elements text from the first tr element we searched for before.

If you have any questions / want me to clarify feel free to ask away.

I have been hacking on things with mechanize for a long while now.

EDIT:

If you want to be able to look up the values this using some argument this is how I imagined you would solve the problem

values = {}
elems.each do |e|
  td = e.search('td')
  values[td.first.text] = td.last.text
end

When you have the values hash filled you can do the following:

if ARG[0] = "Everest"

then

> values["Everest"] => "1953; Sir E. Hillary, T. Norgay"



来源:https://stackoverflow.com/questions/23500507/extract-data-from-html-table-with-mechanize

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!