Python remove anything that is not a letter or number

后端 未结 7 1279
甜味超标
甜味超标 2020-12-24 01:40

I\'m having a little trouble with Python regular expressions.

What is a good way to remove all characters in a string that are not letters or numbers?

Thanks

相关标签:
7条回答
  • 2020-12-24 02:27

    [\w] matches (alphanumeric or underscore).

    [\W] matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)

    You need [\W_] to remove ALL non-alphanumerics.

    When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+ instead of doing it one at a time.

    Now all you need is to define alphanumerics:

    str object, only ASCII A-Za-z0-9:

        re.sub(r'[\W_]+', '', s)
    

    str object, only locale-defined alphanumerics:

        re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
    

    unicode object, all alphanumerics:

        re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
    

    Examples for str object:

    >>> import re, locale
    >>> sall = ''.join(chr(i) for i in xrange(256))
    >>> len(sall)
    256
    >>> re.sub('[\W_]+', '', sall)
    '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
    '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    >>> locale.setlocale(locale.LC_ALL, '')
    'English_Australia.1252'
    >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
    '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\
    x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\
    xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\
    xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\
    xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
    # above output wrapped at column 80
    

    Unicode example:

    >>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE)
    u'abAZ\xff\u0404'
    
    0 讨论(0)
提交回复
热议问题