Encoding issue with string stored in database

问题

I have an encoding problem. I have text in my MongoDB that is wrongly encoded. The source file of the texts in my db is encoded in ISO-8859-1. Now, in viewing it in my db, some characters were broken (become '�').

Currently, in retrieving text from db i tried the following codes.

var t = Collection.FindOne(Query.EQ("id", "2014121500892"));
string message = t["b203"].AsString;
Console.WriteLine(ChangeEncoding(message));

First attempt:

static string ChangeEncoding(string message)
{

    System.Text.Encoding srcEnc = System.Text.Encoding.GetEncoding("ISO-8859-1");
    System.Text.Encoding destEnc = System.Text.Encoding.GetEncoding("UTF-8");
    byte[] bData = srcEnc.GetBytes(message);
    byte[] bResult = System.Text.Encoding.Convert(srcEnc, destEnc, bData);
    return destEnc.GetString(bResult);
}

Second attempt:

static string ChangeEncoding(string message)
{
    File.WriteAllText("text.txt", message, Encoding.GetEncoding("ISO-8859-1"));
    return File.ReadAllText("text.txt");
}

Sample text in db:

Box aus Pappe f�r A8-Lernk�rtchen

Desired result:

I want to be able to print it in console as:

Box aus Pappe für A8-Lernkärtchen

回答1:

Short version

Your data is lost and there is no general solution how to recover the original strings.

Longer version

What supposedly happened when the data was stored, the strings where encoded as ISO-8859-1 but stored as Unicode UTF8. Here's an example:

string orig = "Lernkärtchen";
byte[] iso88891Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(orig);
// { 76, 101, 114, 110, 107, 228, 114, 116, 99, 104, 101, 110 }
//  'L', 'e', 'r', 'n', 'k', 'ä', 'r', 't', 'c', 'h', 'e', 'n'

When this data was passed (somehow...) to the database which only works with Unicode strings:

string storedValue = Encoding.UTF8.GetString(iso88891Bytes);
byte[] dbData = Encoding.UTF8.GetBytes(storedValue);
// { 76, 101, 114, 110, 107, 239, 191, 189, 114, 116, 99, 104, 101, 110 }
//  'L', 'e', 'r', 'n', 'k',      '�',     'r', 't', 'c', 'h', 'e', 'n'

The problem is that the byte 228 (11100100 binary) is not valid for utf8 since for such a byte, 2 other bytes must follow which have values > 127. For details, see UTF8 on Wikipedia, chapter "Description".

So what happens is that the byte formerly known as the character 'ä' cannot be decoded into a valid unicode character and is replaced by the bytes 239, 191 and 189. Which is 11101111, 10111111 and 10111101 which results in the code point with value 1111111111111101 (0xFFFD) which is the character '�' you see in your output.

This character is used for exactly that purpose. On Wikipedia Unicode special characters page it says:

U+FFFD � replacement character used to replace an unknown or unrepresentable character

Try to revert that change? Good luck.

Btw, Unicode and UTF-8 are awesome ♥, never use anything else ☠!

来源：https://stackoverflow.com/questions/28216928/encoding-issue-with-string-stored-in-database

标签

unicode

encoding

mongodb-.net-driver