How do I scrape HTML between two HTML comments using Nokogiri?

纵饮孤独 提交于 2019-12-06 09:12:48

Extracting content between two nodes is not a normal thing we'd do; Normally we'd want content inside a particular node. Comments are nodes, they're just special types of nodes.

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<body>
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
</body>
EOT

By looking for a comment containing the specified text it's possible to find a starting node:

start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #<Nokogiri::XML::Comment:0x3fe94994268c " begin content ">

Once that's found then a loop is needed that stores the current node, then looks for the next sibling until it finds another comment:

content = Nokogiri::XML::NodeSet.new(doc)
contained_node = start_comment.next_sibling
loop do
  break if contained_node.comment?
  content << contained_node
  contained_node = contained_node.next_sibling
end

content.to_html # => "\n <div>some text</div>\n <div><p>Some more elements</p></div>\n"
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!