Parallel file matching, Python

后端 未结 4 1191
渐次进展
渐次进展 2020-12-15 02:02

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as

4条回答
  •  难免孤独
    2020-12-15 02:16

    I'm a bit confused as to how your Python script ended up being faster than your find/grep combo. If you want to use grep in a way somewhat similar to what's suggested by Ron Smith in his answer, you can do something like

    find -type f | xargs -d \\n -P 8 -n 100 grep --file=/root/patterns
    

    to launch grep processes which will process 100 files before exiting, keeping up to 8 such processes active at any one time. Having them process 100 files should make the process startup overhead time of each one negligible.

    note: The -d \\n option to xargs is a GNU extension which won't work on all POSIX-ish systems. It specifies that the *d*elimiter between filenames is a newline. Although technically filenames can contain newlines, in practice nobody does this and keeps their jobs. For compatibility with non-GNU xargs you need to add the -print0 option to find and use -0 instead of -d \\n with xargs. This will arrange for the null byte \0 (hex 0x00) to be used as the delimiter both by find and xargs.

    You could also take the approach of first counting the number of files to be grepped

    NUMFILES="$(find -type f | wc -l)";
    

    and then using that number to get an even split among the 8 processes (assuming bash as shell)

    find -type f | xargs -d \\n -P 8 -n $(($NUMFILES / 8 + 1)) grep --file=/root/patterns
    

    I think this might work better because the disk I/O of find won't be interfering with the disk I/O of the various greps. I suppose it depends in part on how large the files are, and whether they are stored contiguously — with small files, the disk will be seeking a lot anyway, so it won't matter as much. Note also that, especially if you have a decent amount of RAM, subsequent runs of such a command will be faster because some of the files will be saved in your memory cache.

    Of course, you can parameterize the 8 to make it easier to experiment with different numbers of concurrent processes.

    As ed. mentions in the comments, it's quite possible that the performance of this approach will still be less impressive than that of a single-process grep -r. I guess it depends on the relative speed of your disk [array], the number of processors in your system, etc.

提交回复
热议问题