Java XML parser adding unnecessary xmlns and xml:space attributes

给你一囗甜甜゛ 提交于 2020-03-19 06:17:11

问题


I'm using Java 11 (AdoptOpenJDK 11.0.5 2019-10-15) on Windows 10. I'm parsing some legacy XHTML 1.1 files, which take the following general form:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" http://www.w3.org/MarkUp/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <title>XHTML 1.1 Skeleton</title>
</head>
<body>
</body>
</html>

I'm using a simple non-validating parser:

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final Document document;
try (InputStream inputStream = new BufferedInputStream(getClass().getResourceAsStream("xhtml-1.1-test.xhtml"))) {
  document = documentBuilder.parse(inputStream);
}

For some reason it's adding extra attributes such as xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" and xml:space="preserve" all over the place:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" version="-//W3C//DTD XHTML 1.1//EN" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
<head xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <title xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XHTML 1.1 Skeleton</title>
</head>
<body xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:space="preserve"></body>
</html>

I know that DTDs can provide default attribute values, but I don't understand why the xmlns:xsi attribute was added, when there appear to be no elements or attributes in that namespace. Furthermore xml:space="preserve" seems to be incorrect altogether; only elements like <pre> should have xml:space="preserve" set, I would think. (Note the version="-//W3C//DTD XHTML 1.1//EN" as well; that's something I don't need or want.)

Am I doing something wrong? Is there a way I can configure the parser not to include this unnecessary cruft?

Interestingly this is not a problem with XHTML 1.0 strict.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

When parsed that yields what one would expect:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

But it is a problem with -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN. So this seems to be just an XHTML 1.1 problem.

Update: I have some potentially helpful news: if I create a new document without a DTD and import the entire document tree into the new document, all this cruft (which apparently comes from implied attributes in the DTD) goes away, because the destination document doesn't have a DTD at all. (See How to force removal of attributes with implied default values from DTD in Java XML DOM .) But this is very inefficient; it would be nice to turn this off altogether when parsing.


回答1:


I've found a workaround, although it's not ideal. The idea is that when a document asks to be parsed with the XHTML 1.1 DTD -//W3C//DTD XHTML 1.1//EN, to really use the XHTML 1.0 Strict DTD -//W3C//DTD XHTML 1.0 Strict//EN instead. For most practical purposes this DTD is effectively almost the same as the one they asked for, but it doesn't bring in all the default cruft.

Remembering that DefaultEntityResolver is my entity resolver with most of the XHTML DTDs predefined (see Complete list of XHTML, MathML, and SVG modules and other entities, with public identifiers?), the implementation looks something like this:

private static final EntityResolver XHTML_1_1_TO_XHTML_1_0_ENTITY_RESOLVER =
    new EntityResolver() {

  private final EntityResolver defaultEntityResolver = DefaultEntityResolver.getInstance();

  @Override
  public InputSource resolveEntity(final String publicID, final String systemID)
      throws SAXException, IOException {
    if(XHTML_1_1_PUBLIC_ID.equals(publicID)) {
      final InputSource inputSource = resolveEntity(XHTML_1_0_STRICT_PUBLIC_ID, systemID);
      inputSource.setPublicId(publicID);
      return inputSource;
    }
    return defaultEntityResolver.resolveEntity(publicID, systemID);
  }

};

Then I would use that entity resolver when parsing:

documentBuilder.setEntityResolver(XHTML_1_1_TO_XHTML_1_0_ENTITY_RESOLVER);

It's somewhat of a kludge, and semantically I don't like it. But for my application I just need a clean, well-formed parsed document with correct entity replacement, so in practice it may produce effectively the same results for most documents.




回答2:


Have you tried the nonvalidating/load-dtd-grammar Xerces configuration feature?

However, I've just been looking at how I do this in Saxon, and I don't ask the XML parser to not-report defaulted attributes, rather I discard them when they are reported. I'm using Xerces as a SAX parser not a DOM parser though. (In SAX, defaulted attributes are reported using Attributes2.isDefaulted()).



来源:https://stackoverflow.com/questions/60603441/java-xml-parser-adding-unnecessary-xmlns-and-xmlspace-attributes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!