问题
In bash (4.3.46(1)) I have some multi-line so called fasta records where each record is initiated by on line with >name and the following lines DNA sequence ([AGCTNacgtn]), here three records:
>chr1
AGCTACTTTT
AGGGNGGTNN
>chr2
TTGNACACCC
TGGGGGAGTA
>chr3
TGACGTGGGT
TCGGGTTTTT
How do I use bash grep to get the second record ? In other languages one might use:
>chr2\n([AGCTNagctn]*\n)*
In Bash I was trying to use the ideas from here (among other SOs). This did not work:
grep -zo '>chr2[AGCTNacgtn]+' file
Result should be:
>chr2
TTGNACACCC
TGGGGGAGTA
SOLUTION
On my system this was the solution (Almost Cyrus' below, i.e. with out the pipe to a second grep .
):
grep -Pzo '>chr1\n[AGCTNacgtn\n]+' file
回答1:
With GNU grep:
grep -Pzo '>chr2\n[AGCTNacgtn\n]+' file | grep .
Output:
>chr2 TTGNACACCC TGGGGGAGTA
回答2:
You can use awk
with custom RS
:
awk -v n=2 -v RS='(^|\n)>' 'NR==n+1{print ">" $0}' file
>chr2
TTGNACACCC
TGGGGGAGTA
回答3:
You should install the FAST perl package. It contains many utilities directly usable from the shell for dealing with fasta
files, like fashead or fastail (and much more)
after installing it is simple as:
fashead -n2 fastafile | fastail -n1
output
>chr2
TTGNA.....
or even simpler
fasgrep chr2 fastafile
with the same output...
回答4:
Try this -
grep 'chr2' -A 2 file
>chr2
TTGNACACCC
TGGGGGAGTA
回答5:
The best tool for working with multi-line records is awk
.
In your case:
awk 'BEGIN{RS=">"} NR==2 {print RS$0}' input.txt
input.txt
>chr1
AGCTACTTTT
AGGGNGGTNN
>chr2
TTGNACACCC
TGGGGGAGTA
>chr3
TGACGTGGGT
TCGGGTTTTT
Explanation:
BEGIN{RS=">"}
Initially set record separator to ">"
NR==2
filter for record #2 only
{print RS$0}
print record #2 with the missing record separator back
来源:https://stackoverflow.com/questions/43398350/grep-bash-multi-line-pattern