ISO6937 to UTF8 give wrong results in C#

问题

I'm reading some binary data from file in C#, also strings which need to be correctly decoded.

I have no problems with for example windows-1251 codepage, but I have incorrect results for ISO6937 - looks like C# is ignoring two byte chars (accent+char).

I'm using this way to decode string from byte:

Encoding.Convert(Encoding.GetEncoding("20269"), Encoding.UTF8, data)

Example:

Kraków

byte[] = 4B 72 61 6B C2 6F 77

result - Krak´ow

I did some research, but I find only some code from MediaPortal at their GitHub, which manually read two byte chars - this is not the nicest way.

Am I doing something wrong or this is Visual Studio bug? (why they gave ability to encode to ISO6937, if this is not working incorrectly?)

回答1:

The Wikipedia page for the encoding does hint at the underlying problem. Quote: "ISO/IEC 6937 does not encode any combining characters whatsoever". So formally the .NET encoder does what the standard says, practically it is not useful.

This can be done better than the linked GitHub code, the much cleaner approach is to make your own Encoding class. About all of the work can be delegated to the .NET encoding, you just have to intercept the diacritics. Which requires using the combining mark and swapping it with the letter. Like this:

class ISO6937Encoding : Encoding {
    private Encoding enc = Encoding.GetEncoding(20269);

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
        int cnt = enc.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
        for (int ix = 0; ix < byteCount; ix++, charIndex++) {
            int bx = byteIndex + ix;
            if (bytes[bx] >= 0xc1 && bytes[bx] <= 0xcf) {
                if (charIndex == chars.Length - 1) chars[charIndex] = '?';
                else {
                    const string subst = "\u0300\u0301\u0302\u0303\u0304\u0306\u0307\u0308?\u030a\u0337?\u030b\u0328\u030c";
                    chars[charIndex] = chars[charIndex + 1];
                    chars[charIndex + 1] = subst[bytes[bx] - 0xc1];
                    ++ix;
                    ++charIndex;
                }
            }
        }
        return cnt;
    }
    // Rest is boilerplate
    public override int GetByteCount(char[] chars, int index, int count) {
        return enc.GetByteCount(chars, index, count);
    }
    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
        return enc.GetBytes(chars, charIndex, charCount, bytes, byteIndex);
    }
    public override int GetCharCount(byte[] bytes, int index, int count) {
        return enc.GetCharCount(bytes, index, count);
    }
    public override int GetMaxByteCount(int charCount) {
        return enc.GetMaxByteCount(charCount);
    }
    public override int GetMaxCharCount(int byteCount) {
        return enc.GetMaxCharCount(byteCount);
    }
}

Not extensively tested.

来源：https://stackoverflow.com/questions/43596227/iso6937-to-utf8-give-wrong-results-in-c-sharp

标签

string

unicode

encoding

utf-8