An equivalent to string.ascii_letters for unicode strings in python 2.x?

后端 未结 4 1858
天命终不由人
天命终不由人 2020-12-16 00:30

In the \"string\" module of the standard library,

string.ascii_letters ## Same as string.ascii_lowercase + string.ascii_uppercase

is

相关标签:
4条回答
  • 2020-12-16 00:30

    That would be a pretty massive constant. Unicode currently covers over 100.000 different characters. So the answer is no.

    The question is why you would need it? There might be some other way of solving whatever your problem is with the unicodedata module, for example.

    Update: You can download files with all unicode datapoint names and other information from ftp://ftp.unicode.org/, and do loads of interesting stuff with that.

    0 讨论(0)
  • 2020-12-16 00:36

    You can construct your own constant of Unicode upper and lower case letters with:

    import unicodedata as ud
    all_unicode = ''.join(unichr(i) for i in xrange(65536))
    unicode_letters = ''.join(c for c in all_unicode
                              if ud.category(c)=='Lu' or ud.category(c)=='Ll')
    

    This makes a string 2153 characters long (narrow Unicode Python build). For code like letter in unicode_letters it would be faster to use a set instead:

    unicode_letters = set(unicode_letters)
    
    0 讨论(0)
  • 2020-12-16 00:47

    As mentioned in previous answers, the string would indeed be way too long. So, you have to target (a) specific language(s).
    [EDIT: I realized it was the case for my original intended use, and for most uses, I guess. However, in the meantime, Mark Tolonen gave a good answer to the question as it was asked, so I chose his answer, although I used the following solution]

    This is easily done with the "locale" module:

    import locale
    import string
    code = 'fr_FR' ## Do NOT specify encoding (see below)
    locale.setlocale(locale.LC_CTYPE, code)
    encoding = locale.getlocale()[1]
    letters = string.letters.decode(encoding)
    

    with "letters" being a 117-character-long unicode string.

    Apparently, string.letters is dependant on the default encoding for the selected language code, rather than on the language itself. Setting the locale to fr_FR or de_DE or es_ES will update string.letters to the same value (since they are all encoded in ISO8859-1 by default).

    If you add an encoding to the language code (de_DE.UTF-8), the default encoding will be used instead for string.letters. That would cause a UnicodeDecodeError if you used the rest of the above code.

    0 讨论(0)
  • 2020-12-16 00:56

    There's no string, but you can check whether a character is a letter using the unicodedata module, in particular its category() function.

    >>> unicodedata.category(u'a')
    'Ll'
    >>> unicodedata.category(u'A')
    'Lu'
    >>> unicodedata.category(u'5')
    'Nd'
    >>> unicodedata.category(u'ф') # Cyrillic f.
    'Ll'
    >>> unicodedata.category(u'٢') # Arabic-indic numeral for 2.
    'Nd'
    

    Ll means "letter, lowercase". Lu means "letter, uppercase". Nd means "numeric, digit".

    0 讨论(0)
提交回复
热议问题