Python UTF-8 Latin-1 displays wrong character

早过忘川 提交于 2020-01-30 02:40:30

问题


I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).

I tried a method like this:

def latin1_to_unicode(character):

    uni = character.decode('latin-1').encode("utf-8")
    retutn uni

It works fine for characters that are not specific to the latin-1 set, but if I try the following example:

print latin1_to_Unicode('å')

It returns Ã¥ instead of å. Same goes for other letters like æ and ø.

Can anyone please explain why this is happening? Thanks

I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem


回答1:


Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.

Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).

My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:

>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥

You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.

Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.

You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • Pragmatic Unicode by Ned Batchelder

  • The Python Unicode HOWTO



来源:https://stackoverflow.com/questions/28630080/python-utf-8-latin-1-displays-wrong-character

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!