问题
Assume a text file (file1
) that contains m lines of alphabetic strings S (S_1, S_2, ..., S_m). Each S is preceded by a short alphanumeric string that acts as a barcode (here: foo1, bar7, baz3). The alphabetic strings S are all identical in length. Each S and its preceding barcode is separated by a whitespace.
$ cat file1
foo1 abcdefghijklmnopqrstuvwxyz
bar7 abcdefghijklmnopqrstuvwxyz
baz3 abcdefghijklmnopqrstuvwxyz
Assume a second file (file2
) that contains n specifications of column ranges R (R_1, R_2, ..., R_n). The column ranges are on a single line and separated by whitespaces. Each R_x is smaller than S. The combined lengths of the ranges (i.e., R_1 + R_2 + ... + R_n) is also smaller than S. None of the ranges overlap or constitute a subset of each other.
$ cat file2
2-11 14-19 23-24
Following this excellent answer, I understand that I can extract the first range (i.e., R_1) of all S via the following awk command, while keeping the barcodes correctly assigned:
awk 'NR==FNR{start=$1;lgth=$2;next} {print $1, substr($2,start,lgth)}' FS='-' file2 FS=' ' file1
However, I am uncertain how to expand the awk-code to loop over all other ranges (here: R_2 and R_3) and append each to the growing matrix.
$ sought_outcome
foo1 bcdefghijknopqrswx
bar7 bcdefghijknopqrswx
baz3 bcdefghijknopqrswx
Edit: For better understanding, here is the sought output illustrated such that the concatenation points are emphasized by whitespaces:
2-11 14-19 23-24
foo1 bcdefghijk nopqrs wx
bar7 bcdefghijk nopqrs wx
baz3 bcdefghijk nopqrs wx
回答1:
awk
to the rescue! without any validation checks!
$ awk 'NR==FNR {printf "%s", "key";
for(i=1;i<=NF;i++)
{split($i,x,"-");
start[i]=x[1];
end[i] =x[2];
printf "%s", FS $i};
print "";
next}
{printf "%s", $1;
for(i in start) printf "%s", FS substr($2,start[i],end[i]-start[i]+1);
print ""}' range file |
column -t
key 2-11 14-19 23-24
foo1 bcdefghijk nopqrs wx
bar7 bcdefghijk nopqrs wx
baz3 bcdefghijk nopqrs wx
or, without the header and splitting
$ awk 'NR==FNR{for(i=1;i<=NF;i++)
{split($i,x,"-"); start[i]=x[1]; end[i]=x[2]};
print ""; n=NF; next}
{printf "%s", $1 FS;
for(i=1;i<=n;i++) printf "%s", substr($2,start[i],end[i]-start[i]+1); print ""}' range file column -t
foo1 bcdefghijknopqrswx
bar7 bcdefghijknopqrswx
baz3 bcdefghijknopqrswx
UPDATE However, perhaps easier with cut/paste
$ paste -d' ' <(cut -d' ' -f1 file) <(cut -d' ' -f2 file | cut -c$(tr ' ' ',' <range))
foo1 bcdefghijknopqrswx
bar7 bcdefghijknopqrswx
baz3 bcdefghijknopqrswx
回答2:
What I came up with turned out to be almost exactly the same as @karakfas 2nd script but I find the way he formats his code extremely hard to read so I figured I'd post this anyway:
$ cat tst.awk
NR==FNR {
for (i=1; i<=NF; i++) {
split($i,range,/-/)
beg[i] = range[1]
end[i] = range[2]
}
numRanges = NF
next
}
{
printf "%s%s", $1, OFS
for (i=1; i<=numRanges; i++) {
printf "%s", substr($2,beg[i],(end[i]-beg[i])+1)
}
print ""
}
$ awk -f tst.awk file2 file1
foo1 bcdefghijknopqrswx
bar7 bcdefghijknopqrswx
baz3 bcdefghijknopqrswx
来源:https://stackoverflow.com/questions/47022297/extracting-column-ranges-and-reconstituting-matrix-via-awk