extract sequences from multifasta file by ID in file using awk

流过昼夜 提交于 2019-12-01 15:50:18

问题


I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.

FASTA file seq.fasta:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT

IDs file id.txt:

7P58X:01332:11636
7P58X:01334:11613

I want to get the fasta file with only those sequences matching the IDs in the id.txt file:

>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC

I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:

(1)

awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta

this code works well for the multiline sequences but IDs have to be inserted separately to the code.

(2)

awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta

this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.

I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?


回答1:


$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC



回答2:


Following awk may help you on same.

awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt  fasta_file


来源:https://stackoverflow.com/questions/49731386/extract-sequences-from-multifasta-file-by-id-in-file-using-awk

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!