I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything b
You can use the re.sub() function to remove these characters:
>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)
"[^a-zA-Z]+"
- look for any group of characters that are NOT
a-zA-z.""
- Replace the matched characters with ""