efficiently replace bad characters

前端 未结 6 1254
梦毁少年i
梦毁少年i 2020-12-07 21:28

I often work with utf-8 text containing characters like:

\\xc2\\x99

\\xc2\\x95

\\xc2\\x85

etc

<
6条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-07 21:36

    There is always regular expressions; just list all of the offending characters inside square brackets like so:

    import re
    print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")
    

    This prints: 'Hello There ', with the unwanted characters replaced by spaces.

    Alternately, if you have a different replacement character for each:

    # remove annoying characters
    chars = {
        '\xc2\x82' : ',',        # High code comma
        '\xc2\x84' : ',,',       # High code double comma
        '\xc2\x85' : '...',      # Tripple dot
        '\xc2\x88' : '^',        # High carat
        '\xc2\x91' : '\x27',     # Forward single quote
        '\xc2\x92' : '\x27',     # Reverse single quote
        '\xc2\x93' : '\x22',     # Forward double quote
        '\xc2\x94' : '\x22',     # Reverse double quote
        '\xc2\x95' : ' ',
        '\xc2\x96' : '-',        # High hyphen
        '\xc2\x97' : '--',       # Double hyphen
        '\xc2\x99' : ' ',
        '\xc2\xa0' : ' ',
        '\xc2\xa6' : '|',        # Split vertical bar
        '\xc2\xab' : '<<',       # Double less than
        '\xc2\xbb' : '>>',       # Double greater than
        '\xc2\xbc' : '1/4',      # one quarter
        '\xc2\xbd' : '1/2',      # one half
        '\xc2\xbe' : '3/4',      # three quarters
        '\xca\xbf' : '\x27',     # c-single quote
        '\xcc\xa8' : '',         # modifier - under curve
        '\xcc\xb1' : ''          # modifier - under line
    }
    def replace_chars(match):
        char = match.group(0)
        return chars[char]
    return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)
    

提交回复
热议问题