Parallel file matching, Python

后端 未结 4 1161
渐次进展
渐次进展 2020-12-15 02:02

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as

4条回答
  •  独厮守ぢ
    2020-12-15 02:20

    Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.

    import os
    import ray
    import re
    
    ray.init()
    
    patterns_file = os.path.expanduser("~/patterns")
    topdir = os.path.expanduser("~/folder")
    
    with open(patterns_file) as f:
        s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
    
    regex = re.compile(s_pat)
    
    @ray.remote
    def match(pattern, fname):
        results = []
        with open(fname, 'rt') as f:
            for line in f:
                if re.search(pattern, line):
                    results.append(fname)
        return results
    
    results = []
    for dirpath, dirnames, filenames in os.walk(topdir):
        for fname in filenames:
            pathname = os.path.join(dirpath, fname)
            results.append(match.remote(regex, pathname))
    
    print("matched files", ray.get(results))
    

    More information including how to run this on a cluster or the cloud is available in the documentatation

提交回复
热议问题