How do I avoid double UTF-8 encoding in XML::LibXML

雨燕双飞 提交于 2020-07-21 07:39:07

问题


My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure. When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.

Here's a piece of code trying to visualize the problem:

use strict; use diagnostics;    use feature 'unicode_strings';
use utf8;   use v5.14;      use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)");    use open qw( :encoding(UTF-8) :std );
use XML::LibXML

# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";

# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" );    $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );

# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );

Spoiler 1: Good case:

<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>

Spoiler 2: Bad case:

<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>

The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.

What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?

(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)


回答1:


ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:

IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!

(serialize is just an alias for toString)

When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.

As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.




回答2:


Since XML documents are parsed without needing any external information, they are binary files rather than text files.

You're telling Perl to encode anything sent to STDOUT[1], but then you proceed to output an XML document to it. You can't apply a character encoding to a binary file as it corrupts it.

Replace

binmode(STDOUT, ":encoding(UTF-8)");

with

binmode(STDOUT);

Note: This assumes the rest of the text you are outputting is just temporary debugging information. The output doesn't otherwise make sense.


  1. In fact, you do this twice! Once using use open qw( :encoding(UTF-8) :std );, and then a second time using binmode(STDOUT, ":encoding(UTF-8)");.


来源:https://stackoverflow.com/questions/21096900/how-do-i-avoid-double-utf-8-encoding-in-xmllibxml

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!