Is u'string' the same as 'string'.decode('XXX')

拈花ヽ惹草 提交于 2020-01-02 03:25:08

问题


Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why? ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string

>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'

The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.

UPDATES

a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文

>>> b = u'中文'
>>> print b
ÖÐÎÄ

回答1:


Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.

Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:

>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'

The values are indeed not equal, because they are not the same text.

Again, it is important that you use the right codec; a different codec will result in very different results:

>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ

I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.

Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.

To see what codec Python detected, print sys.stdin.encoding:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'

Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.

I urge you to read:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

to gain a more complete understanding on how Unicode works, and how Python handles Unicode.




回答2:


Assuming Python2.7 by the title.

The answer is no. No because when you issue string.decode(XXX) you'll get a Unicode depending on the codec you pass as argument.

When you use u'string' the codec is inferred by the shell's current encoding, or if it's a file, you'll get ascii as default or whatever # coding: utf-8 special comment you insert at the beginning of the script.

Just to clearify, if codec XXX is ensured to always be the same codec used for the script's input (either the shell or the file) then both approaches behave pretty much the same.

Hope this helps!



来源:https://stackoverflow.com/questions/20973745/is-ustring-the-same-as-string-decodexxx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!