How do I parse this data structure returned by Nokogiri in Ruby?

旧街凉风 提交于 2019-12-13 08:37:01

问题


So I am cycling through an array element and this is the result returned:

[nil, [#<Nokogiri::XML::Element:0x835386d4 name="a" attributes=[#<Nokogiri::XML::Attr:0x835385f8 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x835381c0 "Web Designer Full time">]>

What I would like to do is access href value, and then the text value. How do I do that?

I tried this:

puts i[:href]

But that generates this error:

TypeError: Symbol as array index

By the way, I am accessing i as an element in the array via each like this:

contents.each do |i|
    puts i.inspect
    puts i[:href]
end

Edit 1:

This is how I am generating the contents array. There is no need to rename it, because it can get confusing :)

contents = {}
first_items.each do |link|
    content_url = link
    content_page = Nokogiri::HTML(open(content_url))
    contents[link[:href]] = content_page.css("p a")
end

puts contents.inspect

This is what gets output:

{nil=>[#<Nokogiri::XML::Element:0x85fee914 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee838 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x85fee400 "Web Designer Full time">]>, #<Nokogiri::XML::Element:0x85fee298 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee1bc name="href" value="http://bham.craigslist.org/web/2959813303.html">] children=[#<Nokogiri::XML::Text:0x85fedd84 "Once in a lifetime opportunity...">]>, #<Nokogiri::XML::Element:0x85fedc1c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fedb40 name="href" value="http://bham.craigslist.org/web/2925485723.html">] children=[#<Nokogiri::XML::Text:0x85fed708 "Website Designer and Blogging Internship!">]>, #<Nokogiri::XML::Element:0x85fed5a0 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fed4c4 name="href" value="http://bham.craigslist.org/web/2918424652.html">] children=[#<Nokogiri::XML::Text:0x85fed08c "Excellent Java Developer Opportunity!">]>, #<Nokogiri::XML::Element:0x85fecf24 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fece48 name="href" value="http://bham.craigslist.org/web/2888669703.html">] children=[#<Nokogiri::XML::Text:0x85feca10 "Freelance Graphic Design">]>, #<Nokogiri::XML::Element:0x85fec8a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec7cc name="href" value="http://bham.craigslist.org/web/2900256461.html">] children=[#<Nokogiri::XML::Text:0x85fec394 "GWT/GXT Developer">]>, #<Nokogiri::XML::Element:0x85fec22c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec150 name="href" value="http://bham.craigslist.org/web/2897641463.html">] children=[#<Nokogiri::XML::Text:0x85febd18 "Website hiring!">]>]}

Here is the full value of the output for i:

--------------------
This is the value of i: 
[nil, [#<Nokogiri::XML::Element:0x85fee914 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee838 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x85fee400 "Web Designer Full time">]>, #<Nokogiri::XML::Element:0x85fee298 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee1bc name="href" value="http://bham.craigslist.org/web/2959813303.html">] children=[#<Nokogiri::XML::Text:0x85fedd84 "Once in a lifetime opportunity...">]>, #<Nokogiri::XML::Element:0x85fedc1c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fedb40 name="href" value="http://bham.craigslist.org/web/2925485723.html">] children=[#<Nokogiri::XML::Text:0x85fed708 "Website Designer and Blogging Internship!">]>, #<Nokogiri::XML::Element:0x85fed5a0 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fed4c4 name="href" value="http://bham.craigslist.org/web/2918424652.html">] children=[#<Nokogiri::XML::Text:0x85fed08c "Excellent Java Developer Opportunity!">]>, #<Nokogiri::XML::Element:0x85fecf24 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fece48 name="href" value="http://bham.craigslist.org/web/2888669703.html">] children=[#<Nokogiri::XML::Text:0x85feca10 "Freelance Graphic Design">]>, #<Nokogiri::XML::Element:0x85fec8a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec7cc name="href" value="http://bham.craigslist.org/web/2900256461.html">] children=[#<Nokogiri::XML::Text:0x85fec394 "GWT/GXT Developer">]>, #<Nokogiri::XML::Element:0x85fec22c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec150 name="href" value="http://bham.craigslist.org/web/2897641463.html">] children=[#<Nokogiri::XML::Text:0x85febd18 "Website hiring!">]>]]
--------------------
This is the value of i.href: 

Edit 2:

By the way, this is what the actual HTML output looks like...I did this:

builder = Nokogiri::HTML::Builder.new do |doc|
    doc.html {
        doc.body {
            contents.each do |el|
                if !el.nil?
                    puts "-" * 20
                    puts "This is the value of el: "
                puts el.inspect

                    puts "-" * 20
                    puts "This is the value of el.href: "           
                 puts el[:href]
                end

                doc.p {
                    doc.a el, :href => el
                    } 
            end     
            }           
        }
end

puts "*" * 50
puts "This is the HTML generated"

puts builder.to_html

This is how it looks:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><a href="&lt;a%20href=%22http://bham.craigslist.org/web/2961573018.html%22&gt;Web%20Designer%20Full%20time&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2959813303.html%22&gt;Once%20in%20a%20lifetime%20opportunity...&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2925485723.html%22&gt;Website%20Designer%20and%20Blogging%20Internship!&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2918424652.html%22&gt;Excellent%20Java%20Developer%20Opportunity!&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2888669703.html%22&gt;Freelance%20Graphic%20Design&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2900256461.html%22&gt;GWT/GXT%20Developer&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2897641463.html%22&gt;Website%20hiring!&lt;/a&gt;">&lt;a href="http://bham.craigslist.org/web/2961573018.html"&gt;Web Designer Full time&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2959813303.html"&gt;Once in a lifetime opportunity...&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2925485723.html"&gt;Website Designer and Blogging Internship!&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2918424652.html"&gt;Excellent Java Developer Opportunity!&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2888669703.html"&gt;Freelance Graphic Design&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2900256461.html"&gt;GWT/GXT Developer&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2897641463.html"&gt;Website hiring!&lt;/a&gt;</a></p></body></html>

回答1:


I think it can be a lot simpler. Nokogiri already parses the document and provides convenient ways to access the content. Rather than looping, storing Nokogiri objects, then trying to extract them, why not try a more direct approach?

Try this code:

content_page.search(//a[@href]).map{ |el| [el[:href], el.text] }

This creates the 2d array containing the text and href for each link in the document, which is what you said in a follow-up comment that you're actually working toward.




回答2:


You can use compact to remove nils:

nodes.compact.each do |node|
  puts node[:href], node.text
end



回答3:


Maybe this, since you have a odd nil in your array.

contents.each do |i|
  if !i.nil?
    puts i.inspect
    puts i[:href]
  end
end

Edit1: Actually I think you just need to do contents = contents[1].

contents = contents[1]
contents.each do |i|
    puts i.inspect
    puts i[:href]
end


来源:https://stackoverflow.com/questions/10290115/how-do-i-parse-this-data-structure-returned-by-nokogiri-in-ruby

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!