Nokogiri Ruby HTML Parser

半腔热情 提交于 2019-12-12 04:25:36

问题


I'm running into problems scraping across multiple pages with Nokogiri. I need to be able to narrow down the results of what I am searching for based on the qualified hrefs first. So here is a script to get all of the hrefs I'm interested in obtaining. However, I'm having trouble parsing out the titles of the article so that I can link to them. It would be great to know that I can manually inspect the elements so that I have the links I want and whenever I find a link I want I can also grab the title/ text describing the article/href as in

<a href.......>Text Linked to</a>

so that I then have a hash with {:source => ".....", :url => ".....", :title => "....."}. Here is the script I have so far. It narrows down the links I am interested in having setup in the hash.

require 'nokogiri'
require 'open-uri'

page = "http://www.huffingtonpost.com/politics/"

doc = Nokogiri::HTML(open(page))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if{|href| href.empty?}

hrefs.each do |h|
    if h.reverse[0,9] != "stnemmoc#"
        if (h.reverse[0,7] == "scitilo") & (h.length > 65)
            puts h
        end
    end
end

If someone could help and maybe explain how it is that I can find the hrefs I want first and then parse the text based on filtering the urls from the hrefs first, as I have here, that would be really nice. Also is it recommended that these Nokogiri scripts are put in Controllers and then sent into the database that way in Rails? I appreciate it.

Thanks


回答1:


I'm not sure I understand your question completely, but I'm going to interpret it as "How do I extract links and access their attributes?"

Simply amend your selector:

links = doc.css('a[href]')

This will give you all a elements that have an href. You can then iterate over these and access their attributes.



来源:https://stackoverflow.com/questions/21792812/nokogiri-ruby-html-parser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!