I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything b
The fastest method is regex
#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)
""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)
#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join