问题
I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:
>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG
The way I'm currently getting the sequence I need is to use grep with -A, so I'll do
grep -A 10 sequence_name filename.fa
and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.
It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?
回答1:
Using the >
as the record separator:
awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
回答2:
Like this maybe:
awk '/>sequence1/{p++;print;next} /^>/{p=0} p' file
So, if the line starts with >sequence1
, set a flag (p
) to start printing, print this line and move to next. On subsequent lines, if the line starts with >
, change p
flag to stop printing. In general, print if the flag p
is set.
Or, improving a little on your grep
solution, use this to cut off the -A (after)
context:
grep -A 999999 "sequence1" file | awk 'NR>1 && /^>/{exit} 1'
So, that prints up to 999999 lines after sequence1
and pipes them into awk
. Awk then looks for a >
at the start of any line after line 1, and exits if it finds one. Until then, the 1
causes awk
to do its standard thing, which is print the current line.
回答3:
Using sed
only:
sed -n '/>sequence3/,/>/ p' | sed '${/>/d}'
回答4:
$ perl -0076 -lane 'print join("\n",@F) if $F[0]=~/sequence2/' file
来源:https://stackoverflow.com/questions/26144692/printing-a-sequence-from-a-fasta-file