Unicode arabic string to user it

问题

i have a variable holding a value like x='مصطفى' and i want to convert it to the form of u'مصطفى' to user it again in some functions .. when i try to do u''+x it alawys give me an error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 0: ordinal not in range(128)

Any help ?

回答1:

You have to know what encoding those bytes are in, and them .decode(encoding) them to get a Unicode string. If you received them from some API, utf8 is a good guess. If you read the bytes from a file typed in Windows Notepad, it is more likely some Arabic(?) code page.

PythonWin 2.7.11 (v2.7.11:6d1b6a68f775, Dec  5 2015, 20:32:19) [MSC v.1500 32 bit (Intel)] on win32.
>>> x='مصطفى' # "Just bytes" in whatever encoding my console uses
>>> x         # Looks like UTF-8.
'\xd9\x85\xd8\xb5\xd8\xb7\xd9\x81\xd9\x89'
>>> x.decode('utf8')  # Success
u'\u0645\u0635\u0637\u0641\u0649'
>>> print(x.decode('utf8'))
مصطفى

回答2:

thanks I solved it :)

the solution will be to do so

u''.encode('utf-8')+x

回答3:

There's two things.

First the meaning of x='مصطفى' is ill-defined, and changes if you save your source file in another encoding. On the other hand x=u'مصطفى'.encode('utf-8') unambiguously means “the bytes you get when you encode that text with UTF-8”.

Second, either use bytes 'abc' or b'abc' or unicode u'abc', but don't mix them. Mixing them in python 2.x produces results which are dependent on where you execute that code. In python 3.x it raises an error (for good reasons).

So given a byte string x, either:

# bytes
'' + x

or:

# unicode, so decode the byte string
u'' + x.decode('utf-8')

来源：https://stackoverflow.com/questions/37555473/unicode-arabic-string-to-user-it

标签

python

python-2.7

unicode

decode