How to replace unicode characters by ascii characters in Python (perl script given)?

前端 未结 5 642
日久生厌
日久生厌 2020-12-05 01:15

I am trying to learn python and couldn\'t figure out how to translate the following perl script to python:

#!/usr/bin/perl -w                     

use open          


        
相关标签:
5条回答
  • 2020-12-05 01:26
    • Use the fileinput module to loop over standard input or a list of files,
    • decode the lines you read from UTF-8 to unicode objects
    • then map any unicode characters you desire with the translate method

    translit.py would look like this:

    #!/usr/bin/env python2.6
    # -*- coding: utf-8 -*-
    
    import fileinput
    
    table = {
              0xe4: u'ae',
              ord(u'ö'): u'oe',
              ord(u'ü'): u'ue',
              ord(u'ß'): None,
            }
    
    for line in fileinput.input():
        s = line.decode('utf8')
        print s.translate(table), 
    

    And you could use it like this:

    $ cat utf8.txt 
    sömé täßt
    sömé täßt
    sömé täßt
    
    $ ./translit.py utf8.txt 
    soemé taet
    soemé taet
    soemé taet
    
    • Update:

    In case you are using python 3 strings are by default unicode and you dont' need to encode it if it contains non-ASCII characters or even a non-Latin characters. So the solution will look as follow:

    line = 'Verhältnismäßigkeit, Möglichkeit'
    
    table = {
             ord('ä'): 'ae',
             ord('ö'): 'oe',
             ord('ü'): 'ue',
             ord('ß'): 'ss',
           }
    
    line.translate(table)
    
    >>> 'Verhaeltnismaessigkeit, Moeglichkeit'
    
    0 讨论(0)
  • 2020-12-05 01:29

    You could try unidecode to convert Unicode into ascii instead of writing manual regular expressions. It is a Python port of Text::Unidecode Perl module:

    #!/usr/bin/env python
    import fileinput
    import locale
    from contextlib import closing
    from unidecode import unidecode # $ pip install unidecode
    
    def toascii(files=None, encoding=None, bufsize=-1):
        if encoding is None:
            encoding = locale.getpreferredencoding(False)
        with closing(fileinput.FileInput(files=files, bufsize=bufsize)) as file:
            for line in file: 
                print unidecode(line.decode(encoding)),
    
    if __name__ == "__main__":
        import sys
        toascii(encoding=sys.argv.pop(1) if len(sys.argv) > 1 else None)
    

    It uses FileInput class to avoid global state.

    Example:

    $ echo 'äöüß' | python toascii.py utf-8
    aouss
    
    0 讨论(0)
  • 2020-12-05 01:30

    I use translitcodec

    >>> import translitcodec
    >>> print '\xe4'.decode('latin-1')
    ä
    >>> print '\xe4'.decode('latin-1').encode('translit/long').encode('ascii')
    ae
    >>> print '\xe4'.decode('latin-1').encode('translit/short').encode('ascii')
    a
    

    You can change the decode language to whatever you need. You may want a simple function to reduce length of a single implementation.

    def fancy2ascii(s):
        return s.decode('latin-1').encode('translit/long').encode('ascii')
    
    0 讨论(0)
  • 2020-12-05 01:39

    For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:

    >>> title = u"Klüft skräms inför på fédéral électoral große"
    >>> import unicodedata
    >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
    'Kluft skrams infor pa federal electoral groe'
    
    0 讨论(0)
  • 2020-12-05 01:48

    Quick and dirty (python2):

    def make_ascii(string):
        return string.decode('utf-8').replace(u'ü','ue').replace(u'ö','oe').replace(u'ä','ae').replace(u'ß','ss').encode('ascii','ignore');
    
    0 讨论(0)
提交回复
热议问题