python-re: How do I match an alpha character

后端 未结 3 1544
借酒劲吻你
借酒劲吻你 2020-11-30 08:42

How can I match an alpha character with a regular expression. I want a character that is in \\w but is not in \\d. I want it unicode compatible tha

3条回答
  •  无人及你
    2020-11-30 08:58

    Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.

    Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:

    (1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
    (2) digits => \d
    (3) underscore => _

    So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]

    Here's a simple example (Python 2.6).

    >>> import re
    >>> rx = re.compile("[^\W\d_]+", re.UNICODE)
    >>> rx.findall(u"abc_def,k9")
    [u'abc', u'def', u'k']
    

    Further exploration reveals a few quirks of this approach:

    >>> import unicodedata as ucd
    >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
    >>> for x in allsorts:
    ...     print repr(x), ucd.category(x), ucd.name(x)
    ...
    u'\u0473' Ll CYRILLIC SMALL LETTER FITA
    u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
    u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
    u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
    u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
    u'\u3020' So POSTAL MARK FACE
    u'\u3021' Nl HANGZHOU NUMERAL ONE
    >>> rx.findall(allsorts)
    [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']
    

    U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d

    U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w

    All CJK ideographs are classed as "letters" and thus match \w

    Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

提交回复
热议问题