Removing anything between XML tags and their content

僤鯓⒐⒋嵵緔 提交于 2019-12-19 04:21:43

问题


I would need to remove anything between XML tags, especially whitespace and newlines.

For example removing whitespace and newslines from:
</node> \n<node id="whatever">

to get:
</node><node id="whatever">

This is not meant for parsing XML by hand, but rather to prepare XML data before it's getting parsed by a tool. To be more specific, I'm using Hpricot (Ruby) to parse XML and unfortunately we're currently stuck on version 0.6.164, so ... I don't know about more recent versions, but this one often returns weird nodes (Objects) that only contain whitespace and line breaks. So the idea is to clean up the XML before converting it into an Hpricot document. Alternative solutions appreciated.

An example from a test: NoMethodError: undefined method `children' for "\n ":Hpricot::Text
The interesting part here is not the NoMethodError, because that's just fine, but that the Hpricot::Text element only contains a newline and nothing more.


回答1:


Please don't use regular expressions to parse XML. It's horribly error prone.

Use a proper XML library, which will make this trivial. There are XML libraries available for just about every programming platform you could ask for - there's really no excuse to use a regular expression for XML.




回答2:


A solution is to select all "blank" text nodes and remove them.

doc = Nokogiri(xml_source)
doc.xpath('//text()[not(normalize-space())]').remove



回答3:


It is generally not a good idea to parse XML using regular expressions. One of the major benefits of XML is that there are dozens of well-tested parsers out there for any language/framework that you might ever want. There are some tricky rules within XML that prevent any regular expression from being able to properly parse XML.

That said, something like:

s/>.*?</></gs

(that is perl syntax) might do what you want. That says take anything from a greater than up to a less than, and strip it away. The "g" at the end says to perform the substitution as many times as needed, and the "s" makes the "." match all characters INCLUDING newlines (otherwise newlines would not be included, so the pattern would need to be run once for each line, and it would not cover tags that span multiple lines).




回答4:


You shouldn't use regex to parse XML or HTML, it's just not reliable and there are way too many edge cases. You should use a XML/HTML parser for this kind of stuff instead.




回答5:


Don't use regex. Try parsing the XML into a DOM, and manipulating from there (what language/framework are you using?);



来源:https://stackoverflow.com/questions/1155293/removing-anything-between-xml-tags-and-their-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!