How to extract FASTA sequences from a file using sequence IDs in adifferent file?

空扰寡人 提交于 2019-12-08 11:29:38

问题


I have two files:

sequence.fasta - a big file with multiple FASTA sequences

ids.txt - consisting of sequence IDs in a tab-delimited format.

I want to extract those sequences into another file from sequence.fasta whose IDs matched in ids.txt.

A sample of sequence.fasta

>AUP4056.1
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLI`

>XIM5213.2
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKD

A sample of ids.txt

AUP4056.1 GUP5213.2 ARD5364.5 HAE6893.7
JIK6023.5 YUP7086.9

I need output as follows

>AUP4056.1
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLI

>GUP5213.2
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKDKLFKKQLTFNDRTAKVHIVNDCLFIVSPSSFEL
YLQEKGESYDEECINNLQYEFQALGLHRKRIIKNDTINFWRCKVIGPKKESFLVGYLVPNTRLFFGDKIL
INNRHLLLEE

I have tried a Perl one-liner, but this is not working. Neither giving any error nor any output.

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.txt sequence.fasta

Could anybody help me correct this code or if there is any other Perl script?


回答1:


The problem here is that one-liners are very hard to follow, understand and untangle.

So write it out 'long hand':

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $id_file, '<', 'ids.txt' ) or die $!;
#use split here, to split any lines on whitespace. 
chomp ( my @ids = map { split } <$id_file> );
close ( $id_file );

my %sequences;

open ( my $input, '<', 'sequence.fasta' ) or die $!;
{
   local $/ = '';    #paragraph mode; Read until blank line

   while ( <$input> ) {
      my ( $id, $sequence ) = m/>\s*(\S+)\n(.*)/ms;
      $sequences{$id} = $sequence;
   }
}

foreach my $id (@ids) {
   if ( $sequences{$id} ) {
      print ">$id\n";
      print "$sequences{$id}\n";
   }
}

If you want to read the filenames from @ARGV:

my ( $ids_file, $sequence_file ) = @ARGV; 

I wouldn't try and compress this back into a one liner - you probably could, but it'll be quite hard to understand when you come back to it.




回答2:


If a one liner is what you want - which your post in fact suggests - this is what you could do:

perl -pe '$i=$1if/^>(\S+)/;map$i{$_}++,split;$i{$i}or$_=""' ids.txt seq.fasta


来源:https://stackoverflow.com/questions/49487007/how-to-extract-fasta-sequences-from-a-file-using-sequence-ids-in-adifferent-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!