efficiently replace bad characters

前端 未结 6 1251
梦毁少年i
梦毁少年i 2020-12-07 21:28

I often work with utf-8 text containing characters like:

\\xc2\\x99

\\xc2\\x95

\\xc2\\x85

etc

<
6条回答
  •  情深已故
    2020-12-07 21:37

    I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.

    \xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?

    Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)

    If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)

    >>> b'\x95'.decode('windows-1252')
    '\u2022'
    >>> import unicodedata
    >>> unicodedata.name(_)
    'BULLET'
    

    If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:

    def restore_windows_1252_characters(s):
        """Replace C1 control characters in the Unicode string s by the
        characters at the corresponding code points in Windows-1252,
        where possible.
    
        """
        import re
        def to_windows_1252(match):
            try:
                return bytes([ord(match.group(0))]).decode('windows-1252')
            except UnicodeDecodeError:
                # No character at the corresponding code point: remove it.
                return ''
        return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)
    

    For example:

    >>> restore_windows_1252_characters('\x95\x99\x85')
    '•™…'
    

提交回复
热议问题