Python: Replace typographical quotes, dashes, etc. with their ascii counterparts

后端未结

关注

 5  1792

梦如初夏 2020-12-31 00:57

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site\'s editor (simple textarea, n

5条回答

梦谈多话 (楼主)

2020-12-31 01:57

You can build on top of the unidecode package.

This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.

import unidecode
import unicodedata
import re

def char_filter(string):
    latin = re.compile('[a-zA-Z]+')
    for char in unicodedata.normalize('NFC', string):
        decoded = unidecode.unidecode(char)
        if latin.match(decoded):
            yield char
        else:
            yield decoded

def clean_string(string):
    return "".join(char_filter(string))

print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé

0 讨论(0)

查看其它5个回答