Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

后端 未结 5 918
感情败类
感情败类 2020-12-30 02:02

I have been given an export from a MySQL database that seems to have had it\'s encoding muddled somewhat over time and contains a mix of HTML char codes such as

5条回答
  •  春和景丽
    2020-12-30 02:05

    The data is only partly unrecoverable due to Windows-1252 encoding having 5 unassigned slots. Some modifications of Windows-1252 fill these with control characters but those don't make it to posts in Stackoverflow. If modified Windows-1252 has been used you can fully recover as long as you don't lose the hidden control characters in copy pastes.

    There is also the non-breaking space character that is ignored or turned into a space usually with copypastes, but that's not an issue when you deal with bytes directly.

    The misencoding abuse this string has gone through is:

    UTF-8 -> Windows-1252 -> UTF-8 -> Windows-1252
    

    To recover, here is an example:

    String a = "Desinfektionslösungstücher für Flächen";
    Encoding utf8 = Encoding.GetEncoding(65001);
    Encoding win1252 = Encoding.GetEncoding(1252);
    
    string result = utf8.GetString(win1252.GetBytes(utf8.GetString(win1252.GetBytes(a))));
    
    Console.WriteLine(result);
    //Desinfektionslösungstücher für Flächen
    

提交回复
热议问题