问题
This must surely be a trivial task with awk
or otherwise, but it's left me scratching my head this morning. I have a file with a format similar to this:
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK 1 genes ADUm.1999,ADUm.3560
pep> AIQLTGK 8 genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR 5 genes ADUm.367
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> AIQLTGK 10 genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR 2 genes ADUm.2146,ADUm.5750
I would like to print a line for each distinct value of the peptides in column 2, meaning the above input would become:
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK 1 genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR 5 genes ADUm.367
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
This is what I've tried so far, but clearly neither does what I need:
awk '{print $2}' file | sort | uniq
# Prints only the peptides...
awk '{print $0, "\t", $1}' file |sort | uniq -u -f 4
# Altogether omits peptides which are not unique...
One last thing, It will need to treat peptides which are substrings of other peptides as distinct values (eg VSSILED and VSSILEDKILSR). Thanks :)
回答1:
One way using awk
:
awk '!array[$2]++' file.txt
Results:
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK 1 genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR 5 genes ADUm.367
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
回答2:
Just use sort:
sort -k 2,2 -u file
The -u
removes duplicate entries (as you wanted), and the -k 2,2
makes just the field 2 the sorting field (and so ignores the rest when checking for duplicates).
回答3:
I would use Perl for this:
perl -nae 'print unless exists $seen{$F[1]}; undef $seen{$F[1]}' < input.txt
The n
switch works line by line with the input, the a
switch splits the line into the @F
array.
回答4:
awk '{if($2==temp){next;}else{print}temp=$2}' your_file
tested below:
> awk '{if($2==temp){next;}else{print}temp=$2}' temp
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK 1 genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR 5 genes ADUm.367
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> AIQLTGK 10 genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
来源:https://stackoverflow.com/questions/12052633/output-whole-line-once-for-each-unique-value-of-a-column-bash