How can I ignore DTD validation but keep the Doctype when writing an XML file?

不羁的心 提交于 2020-01-02 08:34:16

问题


I am working on a system that should be able to read any (or at least, any well-formed) XML file, manipulate a few nodes and write them back into that same file. I want my code to be as generic as possible and I don't want

  • hardcoded references to Schema/Doctype information anywhere in my code. The doctype information is in the source document, I want to keep exactly that doctype information and not provide it again from within my code. If a document has no DocType, I won't add one. I do not care about the form or content of these files at all, except for my few nodes.
  • custom EntityResolvers or StreamFilters to omit or otherwise manipulate the source information (It is already a pity that namespace information seems somehow inaccessible from the document file where it is declared, but I can manage by using uglier XPaths)
  • DTD validation. I don't have the referenced DTDs, I don't want to include them and Node manipulation is perfectly possible without knowing about them.

The aim is to have the source file entirely unchanged except for the changed Nodes, which are retrieved via XPath. I would like to get away with the standard javax.xml stuff.

My progress so far:

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

    factory.setAttribute("http://xml.org/sax/features/namespaces", true);
    factory.setAttribute("http://xml.org/sax/features/validation", false);
    factory.setAttribute("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
    factory.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

    factory.setNamespaceAware(true);
    factory.setIgnoringElementContentWhitespace(false);
    factory.setIgnoringComments(false);
    factory.setValidating(false);
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document document = builder.parse(new InputSource(inStream));

This loads the XML source into a org.w3c.dom.Document successfully, ignoring DTD validation. I can do my replacements and then I use

    Source source = new DOMSource(document);
    Result result = new StreamResult(getOutputStream(getPath()));

    // Write the DOM document to the file
    Transformer xformer = TransformerFactory.newInstance().newTransformer();
    xformer.transform(source, result);

to write it back. Which is nearly perfect. But the Doctype tag is gone, no matter what I do. While debugging, I saw that there is a DeferredDoctypeImpl [log4j:configuration: null] object in the Document object after parsing, but it is somehow wrong, empty or ignored. The file I tested on starts like this (but it is the same for other file types):

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">

<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/" debug="false">

[...]

I think there are a lot of (easy?) ways involving hacks or pulling additional JARs into the project. But I would rather like to have it with the tools I already use.


回答1:


Sorry, got it right now using a XMLSerializer instead of the Transformer...




回答2:


Here's how you could do it using the LSSerializer found in JDK:

    private void writeDocument(Document doc, String filename)
            throws IOException {
        Writer writer = null;
        try {
            /*
             * Could extract "ls" to an instance attribute, so it can be reused.
             */
            DOMImplementationLS ls = (DOMImplementationLS) 
                    DOMImplementationRegistry.newInstance().
                            getDOMImplementation("LS");
            writer = new OutputStreamWriter(new FileOutputStream(filename));
            LSOutput lsout = ls.createLSOutput();
            lsout.setCharacterStream(writer);
            /*
             * If "doc" has been constructed by parsing an XML document, we
             * should keep its encoding when serializing it; if it has been
             * constructed in memory, its encoding has to be decided by the
             * client code.
             */
            lsout.setEncoding(doc.getXmlEncoding());
            LSSerializer serializer = ls.createLSSerializer();
            serializer.write(doc, lsout);
        } catch (Exception e) {
            throw new IOException(e);
        } finally {
            if (writer != null) writer.close();
        }
    }

Needed imports:

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import org.w3c.dom.Document;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;

I know this is an old question which has already been answered, but I think the technical details might help someone.




回答3:


I tried using the LSSerializer library and was unable to get anywhere with it in terms of retaining the Doctype. This is the solution that Stephan probably used Note: This is in scala but uses a java library so just convert your code

import com.sun.org.apache.xml.internal.serialize.{OutputFormat, XMLSerializer}
 def transformXML(root: Element, file: String): Unit = {
    val doc = root.getOwnerDocument
    val format = new OutputFormat(doc)
    format.setIndenting(true)
    val writer = new OutputStreamWriter(new FileOutputStream(new File(file)))
    val serializer = new XMLSerializer(writer, format)
    serializer.serialize(doc)

  }


来源:https://stackoverflow.com/questions/582352/how-can-i-ignore-dtd-validation-but-keep-the-doctype-when-writing-an-xml-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!