XmlDocument with Kanji text content is not encoded correctly to ISO-8859-1 using XmlTextWriter

别来无恙 提交于 2021-01-28 05:35:53

问题


I have an XmlDocument that includes Kanji in its text content, and I need to write it to a stream using ISO-8859-1 encoding. When I do, none of the Kanji characters are encoded properly, and are instead replaced with "??".

Here is sample code that demonstrates how the XML is written from the XmlDocument:

MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();

What can be done to correctly encode Kanji in this specific situation?


回答1:


As mentioned in comments, the ? character is showing up because Kanji characters are not supported by the encoding ISO-8859-1, so it substitutes ? as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding:

Note that the encoding classes allow errors (unsupported characters) to:

  • Silently change to a "?" character.
  • Use a "best fit" character.
  • Change to an application-specific behavior through use of the EncoderFallback and DecoderFallback classes with the U+FFFD Unicode replacement character.

This is the behavior you are seeing.

However, even though Kanji characters are not supported by ISO-8859-1, you can get a much better result by switching to the newer XmlWriter returned by XmlWriter.Create(Stream, XmlWriterSettings) and setting your encoding on XmlWriterSettings.Encoding like so:

MemoryStream mStream = new MemoryStream();

var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
    Encoding = enc,
    CloseOutput = false,
    // Remove to enable the XML declaration if you want it.  XmlTextWriter doesn't include it automatically.
    OmitXmlDeclaration = true,  
};
using (var writer = XmlWriter.Create(mStream, settings))
{
    doc.WriteTo(writer);
}

mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();

By setting the Encoding property of XmlWriterSettings, the XML writer will be made aware whenever a character is not supported by the current encoding and automatically replace it with an XML character entity reference rather than some hardcoded fallback.

E.g. say you have XML like the following:

<Root>
  <string>畑 はたけ hatake "field of crops"</string>
</Root>

Then your code will output the following, mapping all Kanji to the single fallback character:

<Root><string>? ??? hatake "field of crops"</string></Root>

Whereas the new version will output:

<Root><string>&#x7551; &#x306F;&#x305F;&#x3051; hatake "field of crops"</string></Root>

Notice that the Kanji characters have been replaced with character entities such as &#x7551;? All compliant XML parsers will recognize and reconstruct those characters, and thus no information will be lost despite the fact that your preferred encoding does not support Kanji.

Finally, as an aside note the documentation for XmlTextWriter states:

Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.

So replacing it with an XmlWriter is a good idea in general.

Sample .Net fiddle demonstrating usage of both writers and asserting that the XML generated by XmlWriter is semantically equivalent to the original XML despite the escaping of characters.



来源:https://stackoverflow.com/questions/48402686/xmldocument-with-kanji-text-content-is-not-encoded-correctly-to-iso-8859-1-using

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!