How can I extract the node names for fragmented XML document using Ruby?

泪湿孤枕 提交于 2019-12-10 12:15:52

问题


I an XML-like document which is pre-processed by a system out of my control. The format of the document is like this:

 <template>
Hello, there <RECALL>first_name</RECALL>.  Thanks for giving me your email.  
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>.  I have just sent you something.
</template>

However, I only get as a text string what is between the <template> tags.

I would like to be able to extract without specifying the tags ahead of time when parsing. I can do this with the Crack gem but only if the tags are at the end of the string and there is only one.

With Crack, I can put a string like

string = "<SETPROFILE><NAME>email</NAME><VALUE>go@go.com</VALUE></SETPROFILE>"

and my output from Crack is:

{"SETPROFILE"=>{"NAME"=>"email", "VALUE"=>"go@go.com"}}

Then I can use a case statement for the possible values I care about.

Given that I need to have multiple <tags> in the string and they cannot be at the end of the string, how can I parse out the node names and the values easily, similar to what I do with crack?

These tags also need to be removed. I would like to continue to use the excellent suggestion from @TinMan.

It works perfectly once I know the name of the tag. The number of tags will be finite. I send the tag to the appropriate method once I know it, but it needs to get parsed out easily first.


回答1:


Using Nokogiri, you can treat the string as a DocumentFragment, then find the embedded nodes:

require 'nokogiri'

doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
Hello, there <RECALL>first_name</RECALL>.  Thanks for giving me your email.  
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>.  I have just sent you something.
EOT

nodes = doc.search('*').each_with_object({}){ |n, h|
  h[n] = n.text
}

nodes # => {#<Nokogiri::XML::Element:0x3ff96083b744 name="RECALL" children=[#<Nokogiri::XML::Text:0x3ff96083a09c "first_name">]>=>"first_name", #<Nokogiri::XML::Element:0x3ff96083b5c8 name="SETPROFILE" children=[#<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>, #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>=>"", #<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">=>""}

Or, more legibly:

nodes = doc.search('*').each_with_object({}){ |n, h|
  h[n.name] = n.text
}

nodes # => {"RECALL"=>"first_name", "SETPROFILE"=>"email", "NAME"=>"email", "VALUE"=>"", "star"=>""}

Getting the content of a particular tag is easy then:

nodes['RECALL'] # => "first_name"

Iterating over all the tags is also easy:

nodes.keys.each do |k| 
  ... 
end

You can even replace a tag and its content with text:

doc.at('RECALL').replace('Fred')
doc.to_xml # => "Hello, there Fred.  Thanks for giving me your email.  \n<SETPROFILE>\n  <NAME>email</NAME>\n  <VALUE>\n    <star/>\n  </VALUE>\n</SETPROFILE>.  I have just sent you something.\n"

How to replace the nested tags is left to you as an exercise.



来源:https://stackoverflow.com/questions/27680007/how-can-i-extract-the-node-names-for-fragmented-xml-document-using-ruby

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!