JAXP parse error on valid XML

做~自己de王妃 提交于 2019-12-24 10:28:53

问题


I am trying to run some XPath Queries on XML in Java and apparently the recommended way to do so is to construct a document first.

Here is the standard JAXP code sample that I was using:

import org.w3c.dom.Document;
import javax.xml.parsers.*;

final DocumentBuilder xmlParser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
final Document doc = xmlParser.parse(xmlFile);

I also tried the Saxon API, but got the same errors:

import net.sf.saxon.s9api.*;

final DocumentBuilder documentBuilder = new Processor(false).newDocumentBuilder();
final XdmNode xdm = documentBuilder.build(new File("out/data/blog.xml"));

Here is a minimal reconstructed example XML which the DocumentBuilder in JDK 1.8 can't parse:

<?xml version="1.1" encoding="UTF-8" ?>
<xml>
    <![CDATA[Some example text with [funny highlight]]]>
</xml>

According to the spec, the square bracket ] just before the end of CDATA marker ]]> is perfectly legal, but the parser just exits with a stack trace and the message org.xml.sax.SAXParseException; XML document structures must start and end within the same entity..

On my original data file which contains a lot of CDATA sections, the message is instead org.xml.sax.SAXParseException; The element type "item" must be terminated by the matching end-tag "</item>". In both cases ´com.sun.org.apache.xerces´ shows up in the stacktrace a lot.

Form both observations it seems as if the parser just didn't end the CDATA section at ]]>.

EDIT: As it turned out, the example will pass when the <?xml ... ?> declaration is omitted. I hadn't checked that before posting here and added it just now.


回答1:


Short answer: add Apache Xerces to the build path, it will automatically be loaded instead of the parser from the JDK and the XML will be parsed just fine! Copy-paste Gradle Dependency:

implementation "xerces:xercesImpl:2.11.0"

Some background: Apache Xerces is indeed the same parser which is also used in the JDK, but even though Xerces 2.11 dates from 2013 the JDK comes with a much older version. That really sucks!

As the Saxon team puts it:

Saxonica recommends use of the Xerces parser from Apache in preference to the version bundled in the JDK, which is known to have some serious bugs.

In case you wonder how simply putting Xerces on the classpath makes the problem disappear: even though the JDK and Saxon DocumentBuilders construct entirely different document types, they both use the same Standard Java Interfaces to call the parser and also the same mechanism to find and load the parser (or rather, the parser factory). In short, a java.util.ServiceLoader is called and looks into all the JARs in the classpath for properties files in META-INF/services and this is how the xercesJar announces that it does provide an XML parser. And good for us, the JDK's own implementation is superseded by anything found there.

After making this bad experience with JDK XML classes, I am even more motivated to refactor projects to use Saxon for XPath processing instead of the implementation of XPath in the JDK. The other reason is the technical advantage of XDM over DOM (same link as above).



来源:https://stackoverflow.com/questions/48843481/jaxp-parse-error-on-valid-xml

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!