Python, remove all non-alphabet chars from string

后端 未结 6 1512
时光说笑
时光说笑 2020-11-30 21:08

I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything b

6条回答
  •  佛祖请我去吃肉
    2020-11-30 21:28

    It is advisable to use PyPi regex module if you plan to match specific Unicode property classes. This library has also proven to be more stable, especially handling large texts, and yields consistent results across various Python versions. All you need to do is to keep it up-to-date.

    If you install it (using pip intall regex or pip3 install regex), you may use

    import regex
    print ( regex.sub(r'\P{L}+', '', 'ABCŁąć1-2!Абв3§4“5def”') )
    // => ABCŁąćАбвdef
    

    to remove all chunks of 1 or more characters other than Unicode letters from text. See an online Python demo. You may also use "".join(regex.findall(r'\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”')) to get the same result.

    In Python re, in order to match any Unicode letter, one may use the [^\W\d_] construct (Match any unicode letter?).

    So, to remove all non-letter characters, you may either match all letters and join the results:

    result = "".join(re.findall(r'[^\W\d_]', text))
    

    Or, remove all chars other than those matched with [^\W\d_]:

    result = re.sub(r'([^\W\d_])|.', r'\1', text, re.DOTALL)
    

    See the regex demo online. However, you may get inconsistent results across various Python versions because the Unicode standard is evolving, and the set of chars matched with \w will depend on the Python version. Using PyPi regex library is highly recommended to get consistent results.

提交回复
热议问题