Unbaking mojibake

后端 未结 1 1754
深忆病人
深忆病人 2020-12-18 09:26

When you have incorrectly decoded characters, how can you identify likely candidates for the original string?

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png


        
相关标签:
1条回答
  • 2020-12-18 10:13

    You could use chardet (install with pip):

    import chardet
    
    your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
    detected_encoding = chardet.detect(your_str)["encoding"]
    
    try:
        correct_str = your_str.decode(detected_encoding)
    except UnicodeDecodeError:
        print("Could not estimate encoding")
    

    Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)

    For Python 3 (source file encoded as utf8):

    import chardet
    import codecs
    
    falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"
    
    try:
        encoded_str = falsely_decoded_str.encode("cp850")
    except UnicodeEncodeError:
        print("could not encode falsely decoded string")
        encoded_str = None
    
    if encoded_str:
        detected_encoding = chardet.detect(encoded_str)["encoding"]
    
        try:
            correct_str = encoded_str.decode(detected_encoding)
        except UnicodeEncodeError:
            print("could not decode encoded_str as %s" % detected_encoding)
    
        with codecs.open("output.txt", "w", "utf-8-sig") as out:
            out.write(correct_str)
    

    In summary:

    >>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
    >>> s.encode('cp850').decode('shift-jis')
    '時間試験観点(アニメパス)_10秒.png'
    
    0 讨论(0)
提交回复
热议问题