Change text in argument for xargs (or GNU Parallel)

问题

I have a program that I can run in two ways: single-end or paired-end mode. Here's the syntax:

program <output-directory-name> <input1> [input2]

Where the output directory and at least one input is required. If I wanted to run this on three files, say, sample A, B, and C, I would use something like find with xargs or parallel:

user@host:~/single$ ls
sampleA.txt  sampleB.txt  sampleC.txt

user@host:~/single$ find . -name "sample*" | xargs -i echo program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt

user@host:~/single$ find . -name "sample*" | parallel --dry-run program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt

But when I want to run the program in "paired-end" mode, I need to give it two inputs. These are related files, but they can't simply be concatenated - you have to run the program with both as inputs. Files are named sensibly, e.g., sampleA_1.txt and sampleA_2.txt.

I want to be able to create this easily on the command line with something like xargs (or preferably parallel):

user@host:~/paired$ ls
sampleA_1.txt  sampleB_1.txt  sampleC_1.txt
sampleA_2.txt  sampleB_2.txt  sampleC_2.txt

user@host:~/paired$ find . -name "sample*_1.txt" | sed/awk? | parallel ?
program ./sampleA-out ./sampleA_1.txt ./sampleA_2.txt
program ./sampleB-out ./sampleB_1.txt ./sampleB_2.txt
program ./sampleC-out ./sampleC_1.txt ./sampleC_2.txt

Ideally, the command would strip off the _1.txt to create the output directory name (sampleA-out, etc), but I really need to be able to take that argument and change the _1 to a _2 for the second input.

I know this is dead simple with a script - I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.

Thanks in advance.

回答1:

I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.

Perl has one-liners, too, just as sed and awk do. You can write:

find . -name "sample*_1.txt" | perl -pe 's/_1\.txt$//' | parallel program {}-out {}_1.txt {}_2.txt

(The -e flag means "the next argument is the program text"; the -p flag means "the program should be run in loop; for each line of input, set $_ to that line, then run the program, then print $_".)

回答2:

With sed and xargs you could do something like this:

find . -name "sample*_1.txt" | sed -n 's/_1\..*$//;h;s/$/_out/p;g;s/$/_1.txt/p;g;s/$/_2.txt/p' | xargs -L 3 echo program

I.e.: sed creates the three arguments and xargs -L 3 composes commands lines with three arguments.

回答3:

Assuming you always have exactly 2 files in your directory for each pair and assuming they get sorted the right way by find (this you can ensure by piping results of find through sort), maybe xargs -l 2 would do the job. This tells xargs to place 2 consecutive incoming parameters on each command line it executes.

回答4:

A shorter version:

parallel --xapply program {1.}.out {1} {2} :::: <(ls *_1.txt) <(ls *_2.txt)

but this only works if every _1.txt has a matching _2.txt and vice versa.

来源：https://stackoverflow.com/questions/9688659/change-text-in-argument-for-xargs-or-gnu-parallel

标签

bash

sed

awk

parallel-processing

xargs