transformer.setOutputProperty(OutputKeys.ENCODING, “UTF-8”) is NOT working

问题

I have the following method to write an XMLDom to a stream:

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    transformer.transform(docSource, new StreamResult(out));
}

I am testing some other XML functionality, and this is just the method that I use to write to a file. My test program generates 33 test cases where files are written out. 28 of them have the following header:

<?xml version="1.0" encoding="UTF-8"?>...

But for some reason, 1 of the test cases now produce:

<?xml version="1.0" encoding="ISO-8859-1"?>...

And four more produce:

<?xml version="1.0" encoding="Windows-1252"?>...

As you can clearly see, I am setting ENCODING output key to UTF-8. These tests used to work on an earlier version of Java. I have not run the tests in a while (more than a year) but running today on "Java(TM) SE Runtime Environment (build 1.6.0_22-b04)" I get this funny behavior.

I have verified that the documents causing the problem were read from files that originally had those encoding. It seems that the new versions of the libraries are attempting to preserve the encoding of the source file that was read. But that is not what I want ... I really do want the output to be in UTF-8.

Does anyone know of any other factor that might cause the transformer to ignore the UTF-8 encoding setting? Is there anything else that has to be set on the document to say to forget the encoding of the file that was originally read?

UPDATE:

I checked out the same project out on another machine, built and ran the tests there. On that machine all the tests pass! All the files have "UTF-8" in their header. That machine has "Java(TM) SE Runtime Environment (build 1.6.0_29-b11)" Both machines are running Windows 7. On the new machine that works correctly, jdk1.5.0_11 is used to make the build, but on the old machine jdk1.6.0_26 is used to make the build. The libraries used for both builds are exactly the same. Can it be a JDK 1.6 incompatibility with 1.5 at build time?

UPDATE:

After 4.5 years, the Java library is still broken, but due to the suggestion by Vyrx below, I finally have a proper solution!

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    out.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>".getBytes("UTF-8"));
    transformer.transform(docSource, new StreamResult(out));
}

The solution is to disable the writing of the header, and to write the correct header just before serializing the XML to the output steam. Lame, but it produces the correct results. Tests broken over 4 years ago are now running again!

回答1:

I had the same problem on Android when serializing emoji characters. When using UTF-8 encoding in the transformer the output was HTML character entities (UTF-16 surrogate pairs), which would subsequently break other parsers that read the data.

This is how I ended up solving it:

StringWriter sw = new StringWriter();
sw.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>");
Transformer t = TransformerFactory.newInstance().newTransformer();

// this will work because we are creating a Java string, not writing to an output
t.setOutputProperty(OutputKeys.ENCODING, "UTF-16"); 
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.transform(new DOMSource(elementNode), new StreamResult(sw));

return IOUtils.toInputStream(sw.toString(), Charset.forName("UTF-8"));

回答2:

To answer the question following code works for me. This can take input encoding and convert the data into output encoding.

        ByteArrayInputStream inStreamXMLElement = new ByteArrayInputStream(strXMLElement.getBytes(input_encoding));
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder(); 
        Document docRepeat = db.parse(new InputSource(new InputStreamReader(inStreamXMLElement, input_encoding)));
        Node elementNode = docRepeat.getElementsByTagName(strRepeat).item(0);

        TransformerFactory tFactory = null;
        Transformer transformer = null;
        DOMSource domSourceRepeat = new DOMSource(elementNode);
        tFactory = TransformerFactory.newInstance();
        transformer = tFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, output_encoding);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        StreamResult sr = new StreamResult(new OutputStreamWriter(bos, output_encoding));


        transformer.transform(domSourceRepeat, sr);
        byte[] outputBytes = bos.toByteArray();
        strRepeatString = new String(outputBytes, output_encoding);

回答3:

I've spent significant amount of time debugging this issue because it was working well on my machine (Ubuntu 14 + Java 1.8.0_45) but wasn't working properly in production (Alpine Linux + Java 1.7).

Contrary to my expectation following from above mentioned answer didn't help.

ByteArrayOutputStream bos = new ByteArrayOutputStream();
StreamResult sr = new StreamResult(new OutputStreamWriter(bos, "UTF-8"));

but this one worked as expected

val out = new StringWriter()
val result = new StreamResult(out)

回答4:

what about?:

public static String documentToString(Document doc) throws Exception{ return(documentToString(doc,"UTF-8")); }//
   public static String documentToString(Document doc, String encoding) throws Exception{
     TransformerFactory transformerFactory =TransformerFactory.newInstance();
     Transformer transformer = null;

if ( "".equals(validateNullString(encoding) ) ) encoding = "UTF-8";
try{
    transformer = transformerFactory.newTransformer();
    transformer.setOutputProperty(OutputKeys.INDENT, "yes") ;
    transformer.setOutputProperty(OutputKeys.ENCODING, encoding) ;
}catch (javax.xml.transform.TransformerConfigurationException error){
    return null;
}

Source source = new DOMSource(doc);    
StringWriter writer = new StringWriter();
Result result = new StreamResult(writer);

try{
    transformer.transform(source,result);
}catch (javax.xml.transform.TransformerException error){
    return null;
}
return writer.toString();    
}//documentToString

回答5:

I could work around the problem by wrapping the Document object passed to the DOMSource constructor. The method getXmlEncoding of my wrapper always returns null, all other methods are delegated to the wrapped Document object.

回答6:

I'm taking a wild shot here, but you mention that you are reading files for the data of the tests. Can you make sure that you that you read the files using the proper encoding so when you write into your OutputStream you already have the data in the proper encoding?

So having something like new InputStreamReader(new FileInputStream(fileDir), "UTF8").

Don't forget that single-argument constructors of FileReader always use the platform default encoding : The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate.

回答7:

Try setting the encoding on your StreamResult specifically:

StreamResult result = new StreamResult(new OutputStreamWriter(out, "UTF-8"));

This way, it should only be able to write out in UTF-8.

来源：https://stackoverflow.com/questions/15592025/transformer-setoutputpropertyoutputkeys-encoding-utf-8-is-not-working

标签

java

xml

xml-serialization

transformer