Cleaning XML document recursively from empty tags with Nokogiri?

问题

I have a nested XML document that looks like this:

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>

I need to remove all empty XML nodes, like <empty/> and <css/>.

I ended up with something like:

doc = Nokogiri::XML::DocumentFragment.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

phone = doc.css("phone")
phone.children.each do | child |
    child.remove if child.inner_text == ''
end

The above code removes only the first empty tag, e.g. <empty/>. I'm not able to go inside the nested block. I think I need some recursive strategy here. I carefully read the Nokogiri documentation and checked a lot of examples but I didn't find a solution yet.

How can I fix this?

I'm using Ruby 1.9.3 and Nokogiri 1.5.10.

回答1:

You should be able find all nodes without any text using the xpath "/phone//*[not(text())]".

require 'nokogiri'

doc = Nokogiri::XML::Document.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

doc.xpath("/phone//*[not(text())]").remove

puts doc.to_s.gsub(/\n\s*\n/, "\n")
#=> <?xml version="1.0"?>
#=> <phone>
#=>   <name>test</name>
#=>   <descr>description</descr>
#=>   <lines>
#=>     <line>12345</line>
#=>   </lines>
#=> </phone>

回答2:

A latecomer with a different approach, hoping to add additional insight. This approach removes the annoying extra new lines and gives you the option to keep the empty fields that have attributes with values set.

require 'nokogiri'

doc = Nokogiri::XML::Document.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

def traverse_and_clean(kid)
  kid.children.map { |child| traverse_and_clean(child) }
  kid.remove if kid.content.blank?
end

traverse_and_clean(doc)

Output

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>
  <lines>
    <line>12345</line>
  </lines>
</phone>

If you find yourself in a peculiar case needing to keep some empty fields that have certain attributes set. All you have to do is slightly change the traverse_and_clean method:

def traverse_and_clean(kid)
  kid.children.map { |child| traverse_and_clean(child) }
  kid.remove if kid.content.blank? && kid.attributes.blank?
end

回答3:

require 'nokogiri'

doc = Nokogiri::XML::Document.parse <<-EOXML
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOXML

nodes = doc.xpath("//phone//*[not(text())]")

nodes.each{|n| n.remove if n.elem? }

puts doc

output

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>

  <lines>
    <line>12345</line>

  </lines>
</phone>

回答4:

Similar to @JustinKo's answer only using CSS selectors:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>
  <empty/>
  <lines>
    <line>12345</line>
    <css/>
  </lines>
</phone>
EOT

doc.search(':empty').remove
puts doc.to_xml

Looking at what it did:

<?xml version="1.0"?>
<phone>
  <name>test</name>
  <descr>description</descr>

  <lines>
    <line>12345</line>

  </lines>
</phone>

Nokogiri implements a lot of jQuery's selectors, so it's always worth looking to see what those extensions can do.

来源：https://stackoverflow.com/questions/20123176/cleaning-xml-document-recursively-from-empty-tags-with-nokogiri

标签

ruby

xml

recursion

nokogiri