Iterating through multiple URLs to parse HTML with Nokogori

醉酒当歌 提交于 2019-12-23 03:15:08

问题


What I'm trying to do is scrape the names and prices of items from multiple vendors using Nokogiri. I'm passing the CSS selectors (to the find names and prices) to Nokogiri with method arguments.

Any guidance on how to pass multiple URLs to the "scrape" method while also passing the other arguments (ex: vendor, item_path)? Or am I going about this the completely wrong way?

Here is the code:

require 'rubygems' # Load Ruby Gems
require 'nokogiri' # Load Nokogiri
require 'open-uri' # Load Open-URI

@@collection = Array.new # Array to hold meta hash

def scrape(url, vendor, item_path, name_path, price_path)
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end
end

scrape( "page_a.html", "Sample Vendor A", "#products", ".title", ".prices")
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

回答1:


You can pass multiple url's the same way you're already doing it in you second example:

scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

Your scrape method will have to iterate through those urls, for instance:

def scrape(urls, vendor, item_path, name_path, price_path)
  urls.each do |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end 
  end   
end

This also means that the first example need also be passed as an array:

scrape( ["page_a.html"], "Sample Vendor A", "#products", ".title", ".prices")



回答2:


FYI, using @@collection is inappropriate. Instead, write your method to return a value:

def scrape(urls, vendor, item_path, name_path, price_path)
  collection = []
  urls.each do |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
      collection << {
        :vendor => vendor,
        :name   => item.css(name_path).text.strip,
        :price  => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
      }
    end 
  end   

  collection
end

Which can be reduced to:

def scrape(urls, vendor, item_path, name_path, price_path)
  urls.map { |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.map { |item| # Iterates through each item on grid
      {
        :vendor => vendor,
        :name   => item.css(name_path).text.strip,
        :price  => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
      }
    } 
  }
end


来源:https://stackoverflow.com/questions/15453115/iterating-through-multiple-urls-to-parse-html-with-nokogori

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!