Parallel file matching, Python

后端未结

关注

 4  1161

渐次进展 2020-12-15 02:02

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as

4条回答

独厮守ぢ (楼主)

2020-12-15 02:20

Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.

import os
import ray
import re

ray.init()

patterns_file = os.path.expanduser("~/patterns")
topdir = os.path.expanduser("~/folder")

with open(patterns_file) as f:
    s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))

regex = re.compile(s_pat)

@ray.remote
def match(pattern, fname):
    results = []
    with open(fname, 'rt') as f:
        for line in f:
            if re.search(pattern, line):
                results.append(fname)
    return results

results = []
for dirpath, dirnames, filenames in os.walk(topdir):
    for fname in filenames:
        pathname = os.path.join(dirpath, fname)
        results.append(match.remote(regex, pathname))

print("matched files", ray.get(results))

More information including how to run this on a cluster or the cloud is available in the documentatation

0 讨论(0)

查看其它4个回答