I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything b
It is advisable to use PyPi regex module if you plan to match specific Unicode property classes. This library has also proven to be more stable, especially handling large texts, and yields consistent results across various Python versions. All you need to do is to keep it up-to-date.
If you install it (using pip intall regex or pip3 install regex), you may use
import regex
print ( regex.sub(r'\P{L}+', '', 'ABCŁąć1-2!Абв3§4“5def”') )
// => ABCŁąćАбвdef
to remove all chunks of 1 or more characters other than Unicode letters from text. See an online Python demo. You may also use "".join(regex.findall(r'\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”')) to get the same result.
In Python re, in order to match any Unicode letter, one may use the [^\W\d_] construct (Match any unicode letter?).
So, to remove all non-letter characters, you may either match all letters and join the results:
result = "".join(re.findall(r'[^\W\d_]', text))
Or, remove all chars other than those matched with [^\W\d_]:
result = re.sub(r'([^\W\d_])|.', r'\1', text, re.DOTALL)
See the regex demo online. However, you may get inconsistent results across various Python versions because the Unicode standard is evolving, and the set of chars matched with \w will depend on the Python version. Using PyPi regex library is highly recommended to get consistent results.