“Fix” String encoding in Java

前端 未结 4 1889
半阙折子戏
半阙折子戏 2020-12-08 01:11

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).<

4条回答
  •  星月不相逢
    2020-12-08 01:55

    I tried this and it worked for some reason

    Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):

     final Charset fromCharset = Charset.forName("windows-1252");
     final Charset toCharset = Charset.forName("UTF-8");
     String fixed = new String(input.getBytes(fromCharset), toCharset);
     System.out.println(input);
     System.out.println(fixed);
    

    The results are:

     input: …Und ich beweg mich (aber heut nur langsam)
     fixed: …Und ich beweg mich (aber heut nur langsam)
    

    Here's another example:

     input: Waun da wuan ned wa (feat. Wolfgang Kühn)
     fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
    

    Here's what is happening and why the trick above seems to work:

    1. The original file was a UTF-8 encoded text file (comma delimited)
    2. That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
    3. The user thought the import was successful because all of the characters in the ASCII range looked okay.

    Now, when we try to "reverse" the process, here is what happens:

     // we start with this garbage, two characters we don't want!
     String input = "ü";
    
     final Charset cp1252 = Charset.forName("windows-1252");
     final Charset utf8 = Charset.forName("UTF-8");
    
     // lets convert it to bytes in windows-1252:
     // this gives you 2 bytes: c3 bc
     // "Ã" ==> c3
     // "¼" ==> bc
     bytes[] windows1252Bytes = input.getBytes(cp1252);
    
     // but in utf-8, c3 bc is "ü"
     String fixed = new String(windows1252Bytes, utf8);
    
     System.out.println(input);
     System.out.println(fixed);
    

    The encoding fixing code above kind of works but fails for the following characters:

    (Assuming the only characters used 1 byte characters from Windows 1252):

    char    utf-8 bytes     |   string decoded as cp1252 -->   as cp1252 bytes 
    ”       e2 80 9d        |       â€�                        e2 80 3f
    Á       c3 81           |       Ã�                         c3 3f
    Í       c3 8d           |       Ã�                         c3 3f
    Ï       c3 8f           |       Ã�                         c3 3f
    Р      c3 90           |       �                         c3 3f
    Ý       c3 9d           |       Ã�                         c3 3f
    

    It does work for some of the characters, e.g. these:

    Þ       c3 9e           |       Þ      c3 9e           Þ
    ß       c3 9f           |       ß      c3 9f           ß
    à       c3 a0           |       à      c3 a0           à
    á       c3 a1           |       á      c3 a1           á
    â       c3 a2           |       â      c3 a2           â
    ã       c3 a3           |       ã      c3 a3           ã
    ä       c3 a4           |       ä      c3 a4           ä
    å       c3 a5           |       Ã¥      c3 a5           å
    æ       c3 a6           |       æ      c3 a6           æ
    ç       c3 a7           |       ç      c3 a7           ç
    

    NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.

提交回复
热议问题