An equivalent to string.ascii_letters for unicode strings in python 2.x?

后端未结

关注

 4  1858

In the \"string\" module of the standard library,

string.ascii_letters ## Same as string.ascii_lowercase + string.ascii_uppercase

相关标签:

4条回答

后悔当初

2020-12-16 00:30

That would be a pretty massive constant. Unicode currently covers over 100.000 different characters. So the answer is no.

The question is why you would need it? There might be some other way of solving whatever your problem is with the unicodedata module, for example.

Update: You can download files with all unicode datapoint names and other information from ftp://ftp.unicode.org/, and do loads of interesting stuff with that.

0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-12-16 00:36
You can construct your own constant of Unicode upper and lower case letters with:
```
import unicodedata as ud
all_unicode = ''.join(unichr(i) for i in xrange(65536))
unicode_letters = ''.join(c for c in all_unicode
                          if ud.category(c)=='Lu' or ud.category(c)=='Ll')
```
This makes a string 2153 characters long (narrow Unicode Python build). For code like letter in unicode_letters it would be faster to use a set instead:
```
unicode_letters = set(unicode_letters)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-12-16 00:47
As mentioned in previous answers, the string would indeed be way too long. So, you have to target (a) specific language(s).
[EDIT: I realized it was the case for my original intended use, and for most uses, I guess. However, in the meantime, Mark Tolonen gave a good answer to the question as it was asked, so I chose his answer, although I used the following solution]

This is easily done with the "locale" module:
```
import locale
import string
code = 'fr_FR' ## Do NOT specify encoding (see below)
locale.setlocale(locale.LC_CTYPE, code)
encoding = locale.getlocale()[1]
letters = string.letters.decode(encoding)
```
with "letters" being a 117-character-long unicode string.

Apparently, string.letters is dependant on the default encoding for the selected language code, rather than on the language itself. Setting the locale to fr_FR or de_DE or es_ES will update string.letters to the same value (since they are all encoded in ISO8859-1 by default).

If you add an encoding to the language code (de_DE.UTF-8), the default encoding will be used instead for string.letters. That would cause a UnicodeDecodeError if you used the rest of the above code.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-12-16 00:56
There's no string, but you can check whether a character is a letter using the unicodedata module, in particular its category() function.
```
>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'A')
'Lu'
>>> unicodedata.category(u'5')
'Nd'
>>> unicodedata.category(u'ф') # Cyrillic f.
'Ll'
>>> unicodedata.category(u'٢') # Arabic-indic numeral for 2.
'Nd'
```
Ll means "letter, lowercase". Lu means "letter, uppercase". Nd means "numeric, digit".
0 讨论(0)
发布评论:

提交评论
- 加载中...