Python: Replace typographical quotes, dashes, etc. with their ascii counterparts

后端 未结 5 1781
梦如初夏
梦如初夏 2020-12-31 00:57

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site\'s editor (simple textarea, n

相关标签:
5条回答
  • 2020-12-31 01:41

    You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.

    Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.

    0 讨论(0)
  • 2020-12-31 01:49

    What about this? It creates translation table first, but honestly I don't think you can do this without it.

    transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-",  u"'''\"\"--") ] ) 
    
    with open( "a.txt", "w", encoding = "utf-8" ) as f_out : 
        a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes”   "
        print( " a_str = " + a_str, file = f_out )
    
        fixed_str = a_str.translate( transl_table )
        print( " fixed_str = " + fixed_str, file = f_out  )
    

    I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
    The output in the a.txt file looks as follows:

    a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” fixed_str = 'funny single quotes' long--and--short dashes 'nice single quotes' "nice double quotes"

    By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language

    0 讨论(0)
  • 2020-12-31 01:49

    There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.

    For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:

    #!/usr/bin/env python3
    
    import unicodedata
    
    def unicode_character_name(char):
        try:
            return unicodedata.name(char)
        except ValueError:
            return None
    
    # Generate all Unicode characters with their names
    all_unicode_characters = []
    for n in range(0, 0x10ffff):    # Unicode planes 0-16
        char = chr(n)               # Python 3
        #char = unichr(n)           # Python 2
        name = unicode_character_name(char)
        if name:
            all_unicode_characters.append((char, name))
    
    # Find all Unicode quotation marks
    print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
    # " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 "                                                                     
    0 讨论(0)
  • 2020-12-31 01:52

    This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html

    -S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes, and ... to ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.” (Note: This option is significant only when the input format is markdown or textile. It is selected automatically when the input format is textile or the output format is latex or context.)

    It's haskell, so you'd have to figure out the interface.

    0 讨论(0)
  • 2020-12-31 01:57

    You can build on top of the unidecode package.

    This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.

    import unidecode
    import unicodedata
    import re
    
    def char_filter(string):
        latin = re.compile('[a-zA-Z]+')
        for char in unicodedata.normalize('NFC', string):
            decoded = unidecode.unidecode(char)
            if latin.match(decoded):
                yield char
            else:
                yield decoded
    
    def clean_string(string):
        return "".join(char_filter(string))
    
    print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
    # prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé
    
    0 讨论(0)
提交回复
热议问题