latin-1 to ascii

前端 未结 6 1070
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-30 01:46

I have a unicode string with accented latin chars e.g.

n=unicode(\'Wikipédia, le projet d’encyclopédie\',\'utf-8\')

I want to convert it to

相关标签:
6条回答
  • 2020-11-30 01:52

    Maketrans (and translate) then convert to ascii:

    intab = u'áéí'  # extend as needed
    outtab = u'aei' # as well as this one
    table = maketrans(intab, outtab)
    
    text = translate(u"Wikipédia, le projet d’encyclopédie", table)
    
    try:
        temp = unicode(text, "utf-8")
        fixed = unicodedata.normalize('NFKD', temp).encode('ASCII', action)
        return fixed
    except Exception, errorInfo:
        print errorInfo
        print "Unable to convert the Unicode characters to xml character entities"
        raise errorInfo
    

    (from here)

    0 讨论(0)
  • 2020-11-30 01:58

    Package unihandecode is

    US-ASCII transliterations of Unicode text.
    an improved version of Python unidecode, that is Python port of Text::Unidecode Perl module by Sean M. Burke .

    pip install Unihandecode
    

    then in python

    import unihandecode
    print(unihandecode.unidecode(u'Wikipédia, le projet d’encyclopédie'))
    

    prints Wikipedia, le projet d'encyclopedie.

    0 讨论(0)
  • 2020-11-30 02:01

    Without measuring, I would expect that the .translate method of Unicode strings is the fastest solution. You should definitely make your own measurements, though.

    0 讨论(0)
  • 2020-11-30 02:05

    So here are three approaches, more or less as given or suggested in other answers:

    # -*- coding: utf-8 -*-
    import codecs
    import unicodedata
    
    x = u"Wikipédia, le projet d’encyclopédie"
    
    xtd = {ord(u'’'): u"'", ord(u'é'): u'e', }
    
    def asciify(error):
        return xtd[ord(error.object[error.start])], error.end
    
    codecs.register_error('asciify', asciify)
    
    def ae():
      return x.encode('ascii', 'asciify')
    
    def ud():
      return unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore')
    
    def tr():
      return x.translate(xtd)
    
    if __name__ == '__main__':
      print 'or:', x
      print 'ae:', ae()
      print 'ud:', ud()
      print 'tr:', tr()
    

    Run as main, this emits:

    or: Wikipédia, le projet d’encyclopédie
    ae: Wikipedia, le projet d'encyclopedie
    ud: Wikipedia, le projet dencyclopedie
    tr: Wikipedia, le projet d'encyclopedie
    

    showing clearly that the unicodedata-based approach, while it does have the convenience of not needing a translation map xtd, can't translate all characters properly in an automated fashion (it works for accented letters but not for the reverse-apostrophe), so it would also need some auxiliary step to deal explicitly with those (no doubt before what's now its body).

    Performance is also interesting. On my laptop with Mac OS X 10.5 and system Python 2.5, quite repeatably:

    $ python -mtimeit -s'import a' 'a.ae()'
    100000 loops, best of 3: 7.5 usec per loop
    $ python -mtimeit -s'import a' 'a.ud()'
    100000 loops, best of 3: 3.66 usec per loop
    $ python -mtimeit -s'import a' 'a.tr()'
    10000 loops, best of 3: 21.4 usec per loop
    

    translate is surprisingly slow (relative to the other approaches). I believe the issue is that the dict is looked into for every character in the translate case (and most are not there), but only for those few characters that ARE there with the asciify approach.

    So for completeness here's "beefed-up unicodedata" approach:

    specstd = {ord(u'’'): u"'", }
    def specials(error):
      return specstd.get(ord(error.object[error.start]), u''), error.end
    codecs.register_error('specials', specials)
    
    def bu():
      return unicodedata.normalize('NFKD', x).encode('ASCII', 'specials')
    

    this gives the right output, BUT:

    $ python -mtimeit -s'import a' 'a.bu()'
    100000 loops, best of 3: 10.7 usec per loop
    

    ...speed isn't all that good any more. So, if speed matters, it's no doubt worth the trouble of making a complete xtd translation dict and using the asciify approach. When a few extra microseconds per translation are no big deal, one might want to consider the bu approach simply for its convenience (only needs a translation dict for, hopefully few, special characters that don't translate correctly with the underlying unicodedata idea).

    0 讨论(0)
  • 2020-11-30 02:13

    The awesome unidecode module does this for you:

    >>> import unidecode
    >>> n = unicode('Wikipédia, le projet d’encyclopédie','utf-8')
    >>> unidecode.unidecode(n)
    "Wikipedia, le projet d'encyclopedie"
    
    0 讨论(0)
  • 2020-11-30 02:14

    The "correct" way to do this is to register your own error handler for unicode encoding/decoding, and in that error handler provide the replacements from è to e and ö to o, etc.

    Like so:

    # -*- coding: UTF-8 -*-
    import codecs
    
    map = {u'é': u'e',
           u'’': u"'",
           # ETC
           }
    
    def asciify(error):
        return map[error.object[error.start]], error.end
    
    codecs.register_error('asciify', asciify)
    
    test = u'Wikipédia, le projet d’encyclopédie'
    print test.encode('ascii', 'asciify')
    

    You might also find something in IBM's ICU library and it's Python bindings PyICU, though, it might be less work.

    0 讨论(0)
提交回复
热议问题