GNU parallel: assign one thread for each node (directories and sub* directories) of an entire tree from a start directory

问题

I would like to benefit from all the potential of parallel command on macOS (it seems there exists 2 versions, GNU and Ole Tange's version but I am not sure).

With the following command:

parallel -j8  find {} ::: *

I will have a big performance if I am located in a directory containing 8 subdirectories. But if all these subdirectories have a small content except for only one, I will have only one thread which will work on the unique "big" directory.

Is there a way to follow the parallelization for this "big directory"? I mean, can the unique thread remaining be helped by other threads (the previous that worked on small subdirectories)?

The ideal case would be that parallel command "switch automatically" when all small sub has been found by find command in the command line above. Maybe I ask too much?
Another potential optimization if it exists: considering a common tree directory structure: Is there a way, similar to for example the command make -j8, to assign each current thread to a sub-(sub-(sub- ....)))) directory and once the current directory has been explored (don't forget, I would like mostly to use this optimization with find Linux command), another thread explore another directory sub-(sub-(sub- ....)))) directory?

Of course, the number of total threads running is not greater than the number specified with parallel command (parallel -j8 in my example above): we can say that if a number of tree elements (1 node=1 directory) are greater than a number of threads, we cannot be over this number.

I know that parallelize in a recursive context is tricky but maybe I can gain a significant factor when I want to find a file into a big tree structure?

That's why I take the example of command make -j8: I don't know how it is coded but that makes me think that we could do the same with the couple parallel/find command line at the beginning of my post.

Finally, I would like to get your advice about these 2 questions and more generally what is possible and what is not possible currently for these suggestions of optimization in order to find more quickly a file with classical find command.

UPDATE 1: As @OleTange said, I don't know the directory structure a priori of what I want gupdatedb to index. So, it is difficult to know the maxdepth in advance. Your solution is interesting but the first execution of find is not multithreaded, you don't use parallel command. I am a little surprised that a multithread version of gupdatedb does not exist : on paper, it is faisible but once we want to code it in the script GNU gupdatedb of MacOS 10.15, it is more difficult.

If someone could have other suggestions, I would take them !

回答1:

If you are going to parallelize find you need to be sure that your disk can deliver data.

For magnetic drives you will rarely see a speedup. For RAID, network drives and SSD sometimes, and for NVMe often.

The simplest way to parallelize find is to use */*:

parallel find ::: */*

Or */*/*:

parallel find ::: */*/*

This will search in sub-sub dirs and in sub-sub-sub dirs.

They will not search the top dirs, but that can be done by running a single additional find with the appropriate -maxdepth.

The above solution assumes you know something about the directory structure, so it is not a general solution.

I have never heard of a general solution. It would involve a breadth first search that would start some workers in parallel. I can see how it could be programmed, but I have never seen it.

If I were to implement it, it would be something like this (lightly tested):

#!/bin/bash

tmp=$(tempfile)
myfind() {
  find "$1" -mindepth 1 -maxdepth 1
}
export -f myfind
myfind . | tee $tmp
while [ -s $tmp ] ; do
    tmp2=$(tempfile)
    cat $tmp | parallel --lb myfind | tee $tmp2
    mv $tmp2 $tmp
done
rm $tmp

(PS: I have reason to believe the parallel written by Ole Tange and GNU Parallel are one and the same).

来源：https://stackoverflow.com/questions/63332050/gnu-parallel-assign-one-thread-for-each-node-directories-and-sub-directories

标签

multithreading

parallel-processing

tree

find

gnu-parallel