An equivalent to string.ascii_letters for unicode strings in python 2.x?

泄露秘密 提交于 2019-11-29 01:42:53

You can construct your own constant of Unicode upper and lower case letters with:

import unicodedata as ud
all_unicode = ''.join(unichr(i) for i in xrange(65536))
unicode_letters = ''.join(c for c in all_unicode
                          if ud.category(c)=='Lu' or ud.category(c)=='Ll')

This makes a string 2153 characters long (narrow Unicode Python build). For code like letter in unicode_letters it would be faster to use a set instead:

unicode_letters = set(unicode_letters)

There's no string, but you can check whether a character is a letter using the unicodedata module, in particular its category() function.

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'A')
'Lu'
>>> unicodedata.category(u'5')
'Nd'
>>> unicodedata.category(u'ф') # Cyrillic f.
'Ll'
>>> unicodedata.category(u'٢') # Arabic-indic numeral for 2.
'Nd'

Ll means "letter, lowercase". Lu means "letter, uppercase". Nd means "numeric, digit".

That would be a pretty massive constant. Unicode currently covers over 100.000 different characters. So the answer is no.

The question is why you would need it? There might be some other way of solving whatever your problem is with the unicodedata module, for example.

Update: You can download files with all unicode datapoint names and other information from ftp://ftp.unicode.org/, and do loads of interesting stuff with that.

As mentioned in previous answers, the string would indeed be way too long. So, you have to target (a) specific language(s).
[EDIT: I realized it was the case for my original intended use, and for most uses, I guess. However, in the meantime, Mark Tolonen gave a good answer to the question as it was asked, so I chose his answer, although I used the following solution]

This is easily done with the "locale" module:

import locale
import string
code = 'fr_FR' ## Do NOT specify encoding (see below)
locale.setlocale(locale.LC_CTYPE, code)
encoding = locale.getlocale()[1]
letters = string.letters.decode(encoding)

with "letters" being a 117-character-long unicode string.

Apparently, string.letters is dependant on the default encoding for the selected language code, rather than on the language itself. Setting the locale to fr_FR or de_DE or es_ES will update string.letters to the same value (since they are all encoded in ISO8859-1 by default).

If you add an encoding to the language code (de_DE.UTF-8), the default encoding will be used instead for string.letters. That would cause a UnicodeDecodeError if you used the rest of the above code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!