Explicit sort parallelization via xargs — Incomplete results from xargs --max-procs

一笑奈何 提交于 2019-12-13 19:41:52

问题


Context

I need to optimize deduplication using 'sort -u' and my linux machine has an old implementation of 'sort' command (i.e. 5.97) that has not '--parallel' option. Although 'sort' implements parallelizable algorithms (e.g. merge-sort), I need to make such parallelization explicit. Therefore, I make it by hand via 'xargs' command that outperforms ~2.5X w.r.t. to the single 'sort -u' method ... when it works fine.

Here the intuition of what I am doing.

I am running a bash script that splits an input file (e.g. file.txt) into several parts (e.g. file.txt.part1, file.txt.part2, file.txt.part3, file.txt.part4). The resulting parts are passed to the 'xargs' command in order to perform parallel deduplication via the sortu.sh script (details at the end). sortu.sh wraps the invocation of 'sort -u' and outputs the resulting file name (e.g. "sortu.sh file.txt.part1" outputs "file.txt.part1.sorted"). Then the resulting sorted parts are passed to a 'sort --merge -u' that merges/deduplicates the input parts assuming that such parts are already sorted.

The problem I am experiencing is on the parallelization via 'xargs'. Here a simplified version of my code:

 AVAILABLE_CORES=4
 PARTS="file.txt.part1
 file.txt.part2
 file.txt.part3
 file.txt.part4"

 SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \
                                      --max-procs=$AVAILABLE_CORES \
                                      bash sortu.sh \
               )
 ...
 #More code for merging the resulting parts $SORTED_PARTS
 ...

The expecting result is a list of sorted parts into the variable SORTED_PARTS:

 echo "$SORTED_PARTS"
 file.txt.part1.sorted
 file.txt.part2.sorted
 file.txt.part3.sorted
 file.txt.part4.sorted

Symptom

Nevertheless, (sometimes) there is a missing sorted part. For instance, the file.txt.part2.sorted:

 echo "$SORTED_PARTS"
 file.txt.part1.sorted
 file.txt.part3.sorted
 file.txt.part4.sorted

This symptom is non-deterministic in its occurrence (i.e. the execution for the same file.txt succeeds and in another time it fails) or in the missing file (i.e. it is not always the same sorted missing part).

Problem

I have a race condition where all the sortu.sh instances finish and 'xargs' sends EOF before the stdout is flushed.

Question

Is there a way to ensure stdout flushing before 'xagrs' sends EOF?

Constraints

I am not able to use neither parallel command nor "--parallel" option of sort command.

sortu.sh code

 #!/bin/bash

 SORTED=$1.sorted
 sort -u $1 > $SORTED
 echo $SORTED

回答1:


The below doesn't write contents out to disk at all, and parallelizes the split process, the sort processes, and the merge, performing all of these at once.

This version has been backported to bash 3.2; a version built for newer releases of bash wouldn't need eval.

#!/bin/bash

nprocs=5  # maybe call nprocs command instead?
fd_min=10 # on bash 4.1, can use automatic FD allocation instead

# create a temporary directory; delete on exit
tempdir=$(mktemp -d "${TMPDIR:-/tmp}/psort.XXXXXX")
trap 'rm -rf "$tempdir"' 0

# close extra FDs and clear traps, before optionally executing another tool.
#
# Doing this in subshells ensures that only the main process holds write handles on the
# individual sorts, so that they exit when those handles are closed.
cloexec() {
    local fifo_fd
    for ((fifo_fd=fd_min; fifo_fd < (fd_min+nprocs); fifo_fd++)); do
        : "Closing fd $fifo_fd"
        # in modern bash; just: exec {fifo_fd}>&-
        eval "exec ${fifo_fd}>&-"
    done
    if (( $# )); then
        trap - 0
        exec "$@"
    fi
}

# For each parallel process:
# - Run a sort -u invocation reading from an FD and writing from a FIFO
# - Add the FIFO's name to a merge sort command
merge_cmd=(sort --merge -u)
for ((i=0; i<nprocs; i++)); do
  mkfifo "$tempdir/fifo.$i"               # create FIFO
  merge_cmd+=( "$tempdir/fifo.$i" )       # add to sort command line
  fifo_fd=$((fd_min+i))
  : "Opening FD $fifo_fd for sort to $tempdir/fifo.$i"
  # in modern bash: exec {fifo_fd}> >(cloexec sort -u >$fifo_fd)
  printf -v exec_str 'exec %q> >(cloexec; exec sort -u >%q)' "$fifo_fd" "$tempdir/fifo.$i"
  eval "$exec_str"
done

# Run the big merge sort recombining output from all the FIFOs
cloexec "${merge_cmd[@]}" &
merge_pid=$!

# Split input stream out to all the individual sort processes...
awk -v "nprocs=$nprocs" \
    -v "fd_min=$fd_min" \
  '{ print $0 >("/dev/fd/" (fd_min + (NR % nprocs))) }'

# ...when done, close handles on the FIFOs, so their sort invocations exit
cloexec

# ...and wait for the merge sort to exit
wait "$merge_pid"


来源:https://stackoverflow.com/questions/31926950/explicit-sort-parallelization-via-xargs-incomplete-results-from-xargs-max-p

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!