Handle wrongly encoded character in Python unicode string

前端 未结 5 588
暗喜
暗喜 2020-12-06 10:01

I am dealing with unicode strings returned by the python-lastfm library.

I assume somewhere on the way, the library gets the encoding wrong and returns a unicode str

相关标签:
5条回答
  • 2020-12-06 10:25

    You have to convert your unicode string into a standard string using some encoding e.g. utf-8:

    some_unicode_string.encode('utf-8')
    

    Apart from that: this is a dupe of

    BeautifulSoup findall with class attribute- unicode encode error

    and at least ten other related questions on SO. Research first.

    0 讨论(0)
  • 2020-12-06 10:30

    Do not str() cast to string what you've got from model fields, as long as it is an unicode string already. (oops I have totally missed that it is not django-related)

    0 讨论(0)
  • 2020-12-06 10:39

    At the beginning of your code, just after imports, add these 3 lines.

    import sys  # import sys package, if not already imported
    reload(sys)
    sys.setdefaultencoding('utf-8')
    

    It will override system default encoding (ascii) for the course of your program.

    Edit: You shouldn't do this unless you are sure of the consequences, see comment below. This post is also helpful: Dangers of sys.setdefaultencoding('utf-8')

    0 讨论(0)
  • 2020-12-06 10:44

    I stumble upon this bug myself while processing a file containing german words that I was unaware it has been encoded in UTF-8. The problem manifest itself when I start processing words and some of them would't show the decoding error.

    # python
    Python 2.7.12 (default, Aug 22 2019, 16:36:40) 
    >>> utf8_word = u"Gl\xfcck"
    >>> print("Word read was: {}".format(utf8_word))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)
    

    I solve the error calling the encode method on the string:

    >>> print("Word read was: {}".format(utf8_word.encode('utf-8')))
    Word read was: Glück
    
    0 讨论(0)
  • 2020-12-06 10:47

    Your unicode string is fine:

    >>> unicodedata.name(u"\xfc")
    'LATIN SMALL LETTER U WITH DIAERESIS'
    

    The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters. It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)

    >>> print u'Gl\xfcck'
    Glück
    
    0 讨论(0)
提交回复
热议问题