How to collect the first of several elements of a node in Nokogiri

↘锁芯ラ 提交于 2021-01-23 08:56:26

问题


I have data that looks like:

<release> 
 <artists>
  <artist>
   <name>Johnny Mnemonic</name>
  </artist>
  <artist>
    <name>Constantine</name>
  </artist>
 <artists>
</release>
<release>
 <artists>
  <artist>
   <name>Speed</name>
  </artist>
  <artist>
    <name>The Matrix</name>
  </artist>
 <artists>
 </release>
 ...and so on.

For each release I want only the data from the first <artist> tag. I tried the following code but it pulls all text from the artists:

page = Nokogiri::XML(open("37.xml"))

page.xpath("//artists[1]").each do |el|

File.open("#{LOCAL_DIR}/37.txt", 'a'){|f| f.write(el)}

回答1:


Nokogiri supports two main types of searches, search and at. search returns a NodeSet, which you should think of like an array. at returns a Node. Either can take a CSS or XPath expression. I prefer CSS since they're more readable, but sometimes you can't easily get where you want to be with one, so try the other.

For your question, it's important to specify the node you want to extract the text from, using text. If your result is too broad you'll get text from between tags in addition to the text inside the tag you want. To avoid that drill down to the most-immediate node to what you're trying to read:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<release> 
<artists>
  <artist>
  <name>Johnny Mnemonic</name>
  </artist>
  <artist>
    <name>Constantine</name>
  </artist>
<artists>
<release>
EOT

Because these look for the name node specifically, the text desired is easy to get without garbage:

doc.at('name').text                # => "Johnny Mnemonic"
doc.at('artist name').text         # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"

These are looser searches so more junk is returned:

doc.at('artist').text  # => "\n   Johnny Mnemonic\n  "
doc.at('artists').text # => "\n  \n   Johnny Mnemonic\n  \n  \n    Constantine\n  \n \n\n"

Using search returns multiple nodes:

doc.search('name').map(&:text)

[
    [0] "Johnny Mnemonic",
    [1] "Constantine"
]

doc.search('artist').map(&:text)

[
    [0] "\n   Johnny Mnemonic\n  ",
    [1] "\n    Constantine\n  "
]

The only real difference between search and at is that at is like search(...).first.

See "How to avoid joining all text from Nodes when scraping" also.

Nokogiri has some additional aliases for convenience: at_css and css, and at_xpath and xpath.


Here are alternate ways, using CSS and XPath accessors to get at the names, clipped from Pry:

[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]



回答2:


Your xpath expression selects the <artists>, not each <artist> tag as you seem to expect.Try this:

doc.search('artists artist').map(&:text)

Your expression "//artists" will retrieve all 'artists' tags, the [1] selects the first of these tags, not the first element inside the tag itself.



来源:https://stackoverflow.com/questions/15485940/how-to-collect-the-first-of-several-elements-of-a-node-in-nokogiri

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!