Parse table using Nokogiri

拜拜、爱过 提交于 2019-12-01 00:14:34

Use:

td//text()[normalize-space()]

This selects all non-white-space-only text node descendents of any td child of the current node (the tr already selected in your code).

Or if you want to select all text-node descendents, regardles whether they are white-space-only or not:

td//text()

UPDATE:

The OP has signaled in a comment that he is getting an unwanted td with content just a ' ' (aka non-breaking space).

To exclude also tds whose content is composed only of (one or more) nbsp characters, use:

td//text()[translate(normalize-space(), ' ', '')]

Simple:

doc.search('//td').each do |cell|
  puts cell.content
end
Phrogz

Simple (but not DRY) way of using alternation:

require 'nokogiri'

doc = Nokogiri::HTML <<ENDHTML
<body><table><thead><tr><td>NOT THIS</td></tr></thead><tr>
  <td>foo</td>
  <td><font>bar</font></td>
</tr></table></body>
ENDHTML

p doc.xpath( '//table/tr/td/text()|//table/tr/td/font/text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=>  #<Nokogiri::XML::Text:0x804286fc "bar">]

See XPath with optional element in hierarchy for a more DRY answer.

In this case, however, you can simply do:

p doc.xpath( '//table/tr/td//text()' )
#=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
#=>  #<Nokogiri::XML::Text:0x804286fc "bar">]

Note that your table structure (and mine above) which does not have an explicit tbody element is invalid for XHTML. Given your explicit table > tr above, however, I assume that you have a reason for this.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!