问题
I have a program that I can run in two ways: single-end or paired-end mode. Here's the syntax:
program <output-directory-name> <input1> [input2]
Where the output directory and at least one input is required. If I wanted to run this on three files, say, sample A, B, and C, I would use something like find with xargs or parallel:
user@host:~/single$ ls
sampleA.txt sampleB.txt sampleC.txt
user@host:~/single$ find . -name "sample*" | xargs -i echo program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt
user@host:~/single$ find . -name "sample*" | parallel --dry-run program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt
But when I want to run the program in "paired-end" mode, I need to give it two inputs. These are related files, but they can't simply be concatenated - you have to run the program with both as inputs. Files are named sensibly, e.g., sampleA_1.txt and sampleA_2.txt.
I want to be able to create this easily on the command line with something like xargs (or preferably parallel):
user@host:~/paired$ ls
sampleA_1.txt sampleB_1.txt sampleC_1.txt
sampleA_2.txt sampleB_2.txt sampleC_2.txt
user@host:~/paired$ find . -name "sample*_1.txt" | sed/awk? | parallel ?
program ./sampleA-out ./sampleA_1.txt ./sampleA_2.txt
program ./sampleB-out ./sampleB_1.txt ./sampleB_2.txt
program ./sampleC-out ./sampleC_1.txt ./sampleC_2.txt
Ideally, the command would strip off the _1.txt to create the output directory name (sampleA-out, etc), but I really need to be able to take that argument and change the _1 to a _2 for the second input.
I know this is dead simple with a script - I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.
Thanks in advance.
回答1:
I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.
Perl has one-liners, too, just as sed
and awk
do. You can write:
find . -name "sample*_1.txt" | perl -pe 's/_1\.txt$//' | parallel program {}-out {}_1.txt {}_2.txt
(The -e
flag means "the next argument is the program text"; the -p
flag means "the program should be run in loop; for each line of input, set $_
to that line, then run the program, then print $_
".)
回答2:
With sed
and xargs
you could do something like this:
find . -name "sample*_1.txt" | sed -n 's/_1\..*$//;h;s/$/_out/p;g;s/$/_1.txt/p;g;s/$/_2.txt/p' | xargs -L 3 echo program
I.e.: sed
creates the three arguments and xargs -L 3
composes commands lines with three arguments.
回答3:
Assuming you always have exactly 2 files in your directory for each pair and assuming they get sorted the right way by find
(this you can ensure by piping results of find
through sort
), maybe xargs -l 2
would do the job. This tells xargs
to place 2 consecutive incoming parameters on each command line it executes.
回答4:
A shorter version:
parallel --xapply program {1.}.out {1} {2} :::: <(ls *_1.txt) <(ls *_2.txt)
but this only works if every _1.txt has a matching _2.txt and vice versa.
来源:https://stackoverflow.com/questions/9688659/change-text-in-argument-for-xargs-or-gnu-parallel