Python: Replace typographical quotes, dashes, etc. with their ascii counterparts

后端 未结 5 1792
梦如初夏
梦如初夏 2020-12-31 00:57

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site\'s editor (simple textarea, n

5条回答
  •  梦谈多话
    2020-12-31 01:57

    You can build on top of the unidecode package.

    This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.

    import unidecode
    import unicodedata
    import re
    
    def char_filter(string):
        latin = re.compile('[a-zA-Z]+')
        for char in unicodedata.normalize('NFC', string):
            decoded = unidecode.unidecode(char)
            if latin.match(decoded):
                yield char
            else:
                yield decoded
    
    def clean_string(string):
        return "".join(char_filter(string))
    
    print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
    # prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé
    

提交回复
热议问题