I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as
Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.
import os
import ray
import re
ray.init()
patterns_file = os.path.expanduser("~/patterns")
topdir = os.path.expanduser("~/folder")
with open(patterns_file) as f:
s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
regex = re.compile(s_pat)
@ray.remote
def match(pattern, fname):
results = []
with open(fname, 'rt') as f:
for line in f:
if re.search(pattern, line):
results.append(fname)
return results
results = []
for dirpath, dirnames, filenames in os.walk(topdir):
for fname in filenames:
pathname = os.path.join(dirpath, fname)
results.append(match.remote(regex, pathname))
print("matched files", ray.get(results))
More information including how to run this on a cluster or the cloud is available in the documentatation