Parallel file matching, Python

后端 未结 4 1162
渐次进展
渐次进展 2020-12-15 02:02

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as

相关标签:
4条回答
  • 2020-12-15 02:16

    I'm a bit confused as to how your Python script ended up being faster than your find/grep combo. If you want to use grep in a way somewhat similar to what's suggested by Ron Smith in his answer, you can do something like

    find -type f | xargs -d \\n -P 8 -n 100 grep --file=/root/patterns
    

    to launch grep processes which will process 100 files before exiting, keeping up to 8 such processes active at any one time. Having them process 100 files should make the process startup overhead time of each one negligible.

    note: The -d \\n option to xargs is a GNU extension which won't work on all POSIX-ish systems. It specifies that the *d*elimiter between filenames is a newline. Although technically filenames can contain newlines, in practice nobody does this and keeps their jobs. For compatibility with non-GNU xargs you need to add the -print0 option to find and use -0 instead of -d \\n with xargs. This will arrange for the null byte \0 (hex 0x00) to be used as the delimiter both by find and xargs.

    You could also take the approach of first counting the number of files to be grepped

    NUMFILES="$(find -type f | wc -l)";
    

    and then using that number to get an even split among the 8 processes (assuming bash as shell)

    find -type f | xargs -d \\n -P 8 -n $(($NUMFILES / 8 + 1)) grep --file=/root/patterns
    

    I think this might work better because the disk I/O of find won't be interfering with the disk I/O of the various greps. I suppose it depends in part on how large the files are, and whether they are stored contiguously — with small files, the disk will be seeking a lot anyway, so it won't matter as much. Note also that, especially if you have a decent amount of RAM, subsequent runs of such a command will be faster because some of the files will be saved in your memory cache.

    Of course, you can parameterize the 8 to make it easier to experiment with different numbers of concurrent processes.

    As ed. mentions in the comments, it's quite possible that the performance of this approach will still be less impressive than that of a single-process grep -r. I guess it depends on the relative speed of your disk [array], the number of processors in your system, etc.

    0 讨论(0)
  • 2020-12-15 02:20

    Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.

    import os
    import ray
    import re
    
    ray.init()
    
    patterns_file = os.path.expanduser("~/patterns")
    topdir = os.path.expanduser("~/folder")
    
    with open(patterns_file) as f:
        s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
    
    regex = re.compile(s_pat)
    
    @ray.remote
    def match(pattern, fname):
        results = []
        with open(fname, 'rt') as f:
            for line in f:
                if re.search(pattern, line):
                    results.append(fname)
        return results
    
    results = []
    for dirpath, dirnames, filenames in os.walk(topdir):
        for fname in filenames:
            pathname = os.path.join(dirpath, fname)
            results.append(match.remote(regex, pathname))
    
    print("matched files", ray.get(results))
    

    More information including how to run this on a cluster or the cloud is available in the documentatation

    0 讨论(0)
  • 2020-12-15 02:38

    If you are willing to upgrade to version 3.2 or better, you can take advantage of the concurrent.futures.ProcessPoolExecutor. I think it will improve performance over the popen method you attempted because it will pre-create a pool of processes where your popen method creates a new process every time. You could write your own code to do the same thing for an earlier version if you can't move to 3.2 for some reason.

    0 讨论(0)
  • 2020-12-15 02:39

    I think that, rather than using the threading module, you should be using the multiprocessing module for your Python solution. Python threads can run afoul of the GIL; the GIL is not a problem if you simply have multiple Python processes going.

    I think that for what you are doing a pool of worker processes is just what you want. By default, the pool will default to one process for each core in your system processor. Just call the .map() method with a list of filenames to check and the function that does the checking.

    http://docs.python.org/library/multiprocessing.html

    If this is not faster than your threading implementation, then I don't think the GIL is your problem.

    EDIT: Okay, I'm adding a working Python program. This uses a pool of worker processes to open each file and search for the pattern in each. When a worker finds a filename that matches, it simply prints it (to standard output) so you can redirect the output of this script into a file and you have your list of files.

    EDIT: I think this is a slightly easier to read version, easier to understand.

    I timed this, searching through the files in /usr/include on my computer. It completes the search in about half a second. Using find piped through xargs to run as few grep processes as possible, it takes about 0.05 seconds, about a 10x speedup. But I hate the baroque weird language you must use to get find to work properly, and I like the Python version. And perhaps on really big directories the disparity would be smaller, as part of the half-second for Python must have been startup time. And maybe half a second is fast enough for most purposes!

    import multiprocessing as mp
    import os
    import re
    import sys
    
    from stat import S_ISREG
    
    
    # uncomment these if you really want a hard-coded $HOME/patterns file
    #home = os.environ.get('HOME')
    #patterns_file = os.path.join(home, 'patterns')
    
    target = sys.argv[1]
    size_limit = int(sys.argv[2])
    assert size_limit >= 0
    patterns_file = sys.argv[3]
    
    
    # build s_pat as string like:  (?:foo|bar|baz)
    # This will match any of the sub-patterns foo, bar, or baz
    # but the '?:' means Python won't bother to build a "match group".
    with open(patterns_file) as f:
        s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
    
    # pre-compile pattern for speed
    pat = re.compile(s_pat)
    
    
    def walk_files(topdir):
        """yield up full pathname for each file in tree under topdir"""
        for dirpath, dirnames, filenames in os.walk(topdir):
            for fname in filenames:
                pathname = os.path.join(dirpath, fname)
                yield pathname
    
    def files_to_search(topdir):
        """yield up full pathname for only files we want to search"""
        for fname in walk_files(topdir):
            try:
                # if it is a regular file and big enough, we want to search it
                sr = os.stat(fname)
                if S_ISREG(sr.st_mode) and sr.st_size >= size_limit:
                    yield fname
            except OSError:
                pass
    
    def worker_search_fn(fname):
        with open(fname, 'rt') as f:
            # read one line at a time from file
            for line in f:
                if re.search(pat, line):
                    # found a match! print filename to stdout
                    print(fname)
                    # stop reading file; just return
                    return
    
    mp.Pool().map(worker_search_fn, files_to_search(target))
    
    0 讨论(0)
提交回复
热议问题