Nokogiri parsing multiple XML feeds at once and sort by date

问题

I am using Rails and Nokogiri to parse some XML feeds.

I have parsed one XML feed, and I want to parse multiple feeds and sort the items by date. They are Wordpress feeds so they have the same structure.

In my controller I have:

def index
  doc = Nokogiri::XML(open('http://somewordpressfeed'))  
  @content = doc.xpath('//item').map do |i| 
  {'title' => i.xpath('title').text, 'url' => i.xpath('link').text, 'date' => i.xpath('pubDate').text.to_datetime} 
  end
end

In my view I have:

<ul>
  <% @content.each do |l| %>
    <li><a href="<%= l['url'] %>"><%= l['title'] %></a> ( <%= time_ago_in_words(l['date']) %> )</li>
  <% end %>
</ul>

The code above works as it should. I tried to parse mulitple feeds and got a 404 error:

  feeds = %w(wordpressfeed1, wordpressfeed2)
  docs = feeds.each { |d| Nokogiri::XML(open(d)) }

How do I parse multiple feeds and add them to a Hash like I do with one XML feed? I need to parse about fifty XML feeds at once on page load.

回答1:

I'd write it all differently.

Try changing index to accept an array of URLs, then loop over them using map, concatenating the results to an array, which you return:

def index(*urls)
  urls.map do |u|
    doc = Nokogiri::XML(open(u))  
    doc.xpath('//item').map do |i| 
      {
        'title' => i.xpath('title').text,
        'url' => i.xpath('link').text,
        'date' => i.xpath('pubDate').text.to_datetime
      } 
    end
  end
end

@content = index('url1', 'url2')

It'd be more Ruby-like to use symbols instead of strings for your hash keys:

{
  :title => i.xpath('title').text,
  :url   => i.xpath('link').text,
  :date  => i.xpath('pubDate').text.to_datetime
}

Also:

feeds = %w(wordpressfeed1, wordpressfeed2)
docs = feeds.each { |d| Nokogiri::XML(open(d)) }

each is the wrong iterator. You want map instead, which will return all the parsed DOMs, assigning them to docs.

This won't fix the 404 error, which is a bad URL, and is a different problem. You're not defining your array correctly:

%w(wordpressfeed1, wordpressfeed2)

should be:

%w(wordpressfeed1 wordpressfeed2)

or:

['wordpressfeed1', 'wordpressfeed2']

EDIT:

I was revisiting this page and noticed:

I need to parse about fifty XML feeds at once on page load.

This is completely, absolutely, the wrong way to go about handling the situation when dealing with grabbing data from other sites, especially fifty of them.

WordPress sites typically have a news (RSS or Atom) feed. There should be a parameter in the feed stating how often its OK to refresh the page. HONOR that interval and don't hit their page more often than that, especially when you are tying your load to a HTML page load or refresh.

There are many reasons why, but it breaks down to "just don't do it" lest you get banned. If nothing else, it'd be trivial to commit a DOS attack on your site using web-page refreshes, and it'd be beating their sites as a result, neither of which is being a good web-developer on your part. You protect yourself first, and they inherit from that.

So, what do you do when you want to get fifty sites and have fast response and not beat up other sites? You cache the data in a database, and then read from that when your page is loaded or refreshed. And, in the background you have another task that fires off periodically to scan the other sites, while honoring their refresh rates.

来源：https://stackoverflow.com/questions/14459907/nokogiri-parsing-multiple-xml-feeds-at-once-and-sort-by-date

标签

ruby-on-rails

ruby

nokogiri