How to parse badly formed XML in Java?

百般思念 提交于 2019-11-30 02:40:10

问题


I have XML that I need to parse but have no control over the creation of. Unfortunately it's not very strict XML and contains things like:

<mytag>This won't parse & contains an ampersand.</mytag>

The javax.xml.stream classes don't like this at all, and rightly error with:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[149,50]
Message: The entity name must immediately follow the '&' in the entity reference.

How can I work around this? I can't change the XML, so I guess I need an error-tolerant parser.

My preference would be for a fix that doesn't require too much disruption to the existing parser code.


回答1:


If it's not valid XML (like the above) then no XML parser will handle it (as you've identified). If you know the scope of the errors (such as the above entity issue), then the simplest solution may be to run a correcting process over it (fixing entities such as inserting entities) and then feed it to an existing parser.

Otherwise you'll have to code one yourself with built-in support for such anomalies. And I can't believe that's anything other than a tedious and error-prone task.




回答2:


Use libraries such as tidy or tagsoup.

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.




回答3:


I believe JSoup can handle badly formed XML



来源:https://stackoverflow.com/questions/920344/how-to-parse-badly-formed-xml-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!