unicode and python issue (access to unicde code charts)

问题

Yesterday i wrote the following function to convert integer to Persian :

def integerToPersian(number):
    listedPersian = ['۰','۱','۲','۳','۴','۵','۶','۷','۸','۹']
    listedEnglish = ['0','1','2','3','4','5','6','7','8','9']    
    returnList = list()

    listedTmpString = list(str(number))

    for i in listedTmpString:
        returnList.append(listedPersian[listedEnglish.index(i)])

    return ''.join(returnList)

When you call it such as : integerToPersian(3455) , it return ۳۴۵۵, ۳۴۵۵ is equivalent to 3455 in Persian and Arabic language.When you read a number such as reading from databae, and want to show in widget, this function is very useful.

I downloaded codes charts of unicode from http://unicode.org ,Because i need to wrote PersianToInteger('unicodeString') According to it should get utf-8 as parameter and utf-8 store 2 bytes,Also i'm newbie in pytho.

My questions are, how can store 2bytes? , how can utf8 store , how can split an unicode string to another format ? how can use unicode code charts?

Notes: I found to use int() built-in fuinction , but i couldn't use it.may be you can

回答1:

You need to read the Python Unicode HOWTO for either Python 2.x or 3.x, as appropriate. But I can give you brief answers to your questions.

My questions are, how can store 2bytes? how can utf8 store , how can split an unicode string to another format ?

A unicode object holds characters; a bytes object holds bytes.

Note that in Python 2.x, str is the same thing as bytes; in 3.x, it's the same thing as unicode. And in both languages, a literal with neither a u nor a b prefix is a str. Since you didn't tell us whether you're using Python 2 or 3, I'll use explicit unicode and bytes, and u and b prefixes, everywhere.

You convert between them by picking an encoding (in this case, UTF-8) and using the encode and decode methods. For example:

>>> my_str = u'۰۱'
>>> my_bytes = b'\xdb\xb0\xdb\xb1'
>>> my_str.encode('utf-8') == my_bytes
True
>>> my_bytes.decode('utf-8') == my_str
True

If you have a UTF-8 bytes object, you should decode it to unicode as early as possible, and do all your work with it in Unicode. Then you don't have to worry about how many bytes something takes, just treat each character as a character. If you need UTF-8 output, encode back as late as possible.

(Very occasionally, the performance cost of decoding and encoding is too high, and you need to deal with UTF-8 directly. But unless that really is a bottleneck in your code, don't do it.)

So, let's say you wanted to adapt your integerToPersian to take a UTF-8 English digit string instead of an integer, and to return a UTF-8 Persian digit string instead of a Unicode one. (I'm assuming Python 3 for the purposes of this example.) All you need to do is change str(number) to number.decode('utf-8'), and change return ''.join(returnList) to return ''.join(returnList).encode('utf-8'), and that's it.

how can use unicode code charts?

Python already comes with the Unicode code charts (and the right ones to match your version of Python) compiled into the unicodedata module, so usually it's a lot easier to just use those than to try to use the charts yourself. For example:

>>> import unicodedata
>>> unicodedata.digit(u'۱')
1

… i need to wrote PersianToInteger('unicodeString')

You really shouldn't need to. Unless you're using a very old Python, int should do it for you. For example, in 2.6:

>>> int(u'۱۱')
11

If it's not working for you, unicodedata is the easiest solution:

>>> numeral = u'۱۱'
>>> [unicodedata.digit(ch) for ch in numeral]
[1, 1]

However, either of these will convert digits in any script to a number, not just Persian. And there's nothing in the Unicode charts that will directly tell you that a digit is Persian; the best you can do is parse the name:

>>> all('ARABIC-INDIC DIGIT' in unicodedata.name(ch) for ch in numeral)
True
>>> all('ARABIC-INDIC DIGIT' in unicodedata.name(ch) for ch in '123')
False

If you really want to do things in either direction by mapping digits from one script to another, here's a better solution:

listedPersian = ['۰','۱','۲','۳','۴','۵','۶','۷','۸','۹']
listedEnglish = ['0','1','2','3','4','5','6','7','8','9']    
persianToEnglishMap = dict(zip(listedPersian, listedEnglish))
englishToPersianMap = dict(zip(listedEnglish, listedPersian))

def persianToNumber(persian_numeral):
    english_numeral = ''.join(persianToEnglishMap[digit] for digit in persial_numeral)
    return int(english_numeral)

来源：https://stackoverflow.com/questions/18707008/unicode-and-python-issue-access-to-unicde-code-charts

标签

python

unicode

utf-8

unicode-string

python-unicode