问题
Context
I need to optimize deduplication using 'sort -u' and my linux machine has an old implementation of 'sort' command (i.e. 5.97) that has not '--parallel' option. Although 'sort' implements parallelizable algorithms (e.g. merge-sort), I need to make such parallelization explicit. Therefore, I make it by hand via 'xargs' command that outperforms ~2.5X w.r.t. to the single 'sort -u' method ... when it works fine.
Here the intuition of what I am doing.
I am running a bash script that splits an input file (e.g. file.txt) into several parts (e.g. file.txt.part1, file.txt.part2, file.txt.part3, file.txt.part4). The resulting parts are passed to the 'xargs' command in order to perform parallel deduplication via the sortu.sh script (details at the end). sortu.sh wraps the invocation of 'sort -u' and outputs the resulting file name (e.g. "sortu.sh file.txt.part1" outputs "file.txt.part1.sorted"). Then the resulting sorted parts are passed to a 'sort --merge -u' that merges/deduplicates the input parts assuming that such parts are already sorted.
The problem I am experiencing is on the parallelization via 'xargs'. Here a simplified version of my code:
AVAILABLE_CORES=4
PARTS="file.txt.part1
file.txt.part2
file.txt.part3
file.txt.part4"
SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \
--max-procs=$AVAILABLE_CORES \
bash sortu.sh \
)
...
#More code for merging the resulting parts $SORTED_PARTS
...
The expecting result is a list of sorted parts into the variable SORTED_PARTS:
echo "$SORTED_PARTS"
file.txt.part1.sorted
file.txt.part2.sorted
file.txt.part3.sorted
file.txt.part4.sorted
Symptom
Nevertheless, (sometimes) there is a missing sorted part. For instance, the file.txt.part2.sorted:
echo "$SORTED_PARTS"
file.txt.part1.sorted
file.txt.part3.sorted
file.txt.part4.sorted
This symptom is non-deterministic in its occurrence (i.e. the execution for the same file.txt succeeds and in another time it fails) or in the missing file (i.e. it is not always the same sorted missing part).
Problem
I have a race condition where all the sortu.sh instances finish and 'xargs' sends EOF before the stdout is flushed.
Question
Is there a way to ensure stdout flushing before 'xagrs' sends EOF?
Constraints
I am not able to use neither parallel command nor "--parallel" option of sort command.
sortu.sh code
#!/bin/bash
SORTED=$1.sorted
sort -u $1 > $SORTED
echo $SORTED
回答1:
The below doesn't write contents out to disk at all, and parallelizes the split process, the sort processes, and the merge, performing all of these at once.
This version has been backported to bash 3.2; a version built for newer releases of bash wouldn't need eval
.
#!/bin/bash
nprocs=5 # maybe call nprocs command instead?
fd_min=10 # on bash 4.1, can use automatic FD allocation instead
# create a temporary directory; delete on exit
tempdir=$(mktemp -d "${TMPDIR:-/tmp}/psort.XXXXXX")
trap 'rm -rf "$tempdir"' 0
# close extra FDs and clear traps, before optionally executing another tool.
#
# Doing this in subshells ensures that only the main process holds write handles on the
# individual sorts, so that they exit when those handles are closed.
cloexec() {
local fifo_fd
for ((fifo_fd=fd_min; fifo_fd < (fd_min+nprocs); fifo_fd++)); do
: "Closing fd $fifo_fd"
# in modern bash; just: exec {fifo_fd}>&-
eval "exec ${fifo_fd}>&-"
done
if (( $# )); then
trap - 0
exec "$@"
fi
}
# For each parallel process:
# - Run a sort -u invocation reading from an FD and writing from a FIFO
# - Add the FIFO's name to a merge sort command
merge_cmd=(sort --merge -u)
for ((i=0; i<nprocs; i++)); do
mkfifo "$tempdir/fifo.$i" # create FIFO
merge_cmd+=( "$tempdir/fifo.$i" ) # add to sort command line
fifo_fd=$((fd_min+i))
: "Opening FD $fifo_fd for sort to $tempdir/fifo.$i"
# in modern bash: exec {fifo_fd}> >(cloexec sort -u >$fifo_fd)
printf -v exec_str 'exec %q> >(cloexec; exec sort -u >%q)' "$fifo_fd" "$tempdir/fifo.$i"
eval "$exec_str"
done
# Run the big merge sort recombining output from all the FIFOs
cloexec "${merge_cmd[@]}" &
merge_pid=$!
# Split input stream out to all the individual sort processes...
awk -v "nprocs=$nprocs" \
-v "fd_min=$fd_min" \
'{ print $0 >("/dev/fd/" (fd_min + (NR % nprocs))) }'
# ...when done, close handles on the FIFOs, so their sort invocations exit
cloexec
# ...and wait for the merge sort to exit
wait "$merge_pid"
来源:https://stackoverflow.com/questions/31926950/explicit-sort-parallelization-via-xargs-incomplete-results-from-xargs-max-p