Python: Replace typographical quotes, dashes, etc. with their ascii counterparts

后端未结

关注

 5  1784

On my website people can post news and quite a few editors use MS word and similar tools to write the text and then copy&paste into my site\'s editor (simple textarea, n

相关标签:

5条回答

说谎

2020-12-31 01:41

You can use the str.translate() method (http://docs.python.org/library/stdtypes.html#str.translate). However, read the doc related to Unicode -- the translation table has another form: unicode ordinal number --> unicode string (usually char) or None.

Well, but it requires the dict. You have to capture the replacements anyway. How do you want to do that without any table or arrays? You could use str.replace() for the single characters, but this would be inefficient.

0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-12-31 01:49
What about this? It creates translation table first, but honestly I don't think you can do this without it.
```
transl_table = dict( [ (ord(x), ord(y)) for x,y in zip( u"‘’´“”–-",  u"'''\"\"--") ] ) 

with open( "a.txt", "w", encoding = "utf-8" ) as f_out : 
    a_str = u" ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes”   "
    print( " a_str = " + a_str, file = f_out )

    fixed_str = a_str.translate( transl_table )
    print( " fixed_str = " + fixed_str, file = f_out  )
```
I wasn't able to run this printing to a console (on Windows) so I had to write to txt file.
The output in the a.txt file looks as follows:

a_str = ´funny single quotes´ long–-and–-short dashes ‘nice single quotes’ “nice double quotes” fixed_str = 'funny single quotes' long--and--short dashes 'nice single quotes' "nice double quotes"

By the way, the code above works in Python 3. If you need it for Python 2, it might need some fixes due to the difference in handling Unicode strings in both versions of the language
0 讨论(0)
发布评论:

提交评论
- 加载中...

栀梦

2020-12-31 01:49

There is no such "proper" solution, because for any given Unicode character there is no "ASCII counterpart" defined.

For example, take the seemingly easy characters that you might want to map to ASCII single and double quotes and hyphens. First, lets generate all the Unicode characters with their official names. Second, lets find all the quotation marks, hyphens and dashes according to the names:

#!/usr/bin/env python3

import unicodedata

def unicode_character_name(char):
    try:
        return unicodedata.name(char)
    except ValueError:
        return None

# Generate all Unicode characters with their names
all_unicode_characters = []
for n in range(0, 0x10ffff):    # Unicode planes 0-16
    char = chr(n)               # Python 3
    #char = unichr(n)           # Python 2
    name = unicode_character_name(char)
    if name:
        all_unicode_characters.append((char, name))

# Find all Unicode quotation marks
print (' '.join([char for char, name in all_unicode_characters if 'QUOTATION MARK' in name]))
# " « » ‘ ’ ‚ ‛ “ ” „ ‟ ‹ › ❛ ❜ ❝ ❞ ❟ ❠ ❮ ❯ ⹂ 〝 〞 〟 ＂


          	          
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2020-12-31 01:52
              
            
            
                                                                       
This tool will normalize punctuation in markdown: http://johnmacfarlane.net/pandoc/README.html


  -S, --smart Produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes,
  and ... to ellipses. Nonbreaking spaces are inserted after certain
  abbreviations, such as “Mr.” (Note: This option is significant only
  when the input format is markdown or textile. It is selected
  automatically when the input format is textile or the output format is
  latex or context.)


It's haskell, so you'd have to figure out the interface.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2020-12-31 01:57
              
            
            
                                                                       
You can build on top of the unidecode package. 

This is pretty slow, since we are normalizing all the unicode first to the combined form, then trying to see what unidecode turns it into. If we match a latin letter, then we actually use the original NFC character. If not, then we yield whatever degarbling unidecode has suggested. This leaves accentuated letters alone, but will convert everything else.

import unidecode
import unicodedata
import re

def char_filter(string):
    latin = re.compile('[a-zA-Z]+')
    for char in unicodedata.normalize('NFC', string):
        decoded = unidecode.unidecode(char)
        if latin.match(decoded):
            yield char
        else:
            yield decoded

def clean_string(string):
    return "".join(char_filter(string))

print(clean_string(u"vis-à-vis “Beyoncé”’s naïve papier–mâché résumé"))
# prints vis-à-vis "Beyoncé"'s naïve papier-mâché résumé

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...