Find multiple matches of this and that nucleotide sequence

六月ゝ 毕业季﹏ 提交于 2019-12-11 12:09:35

问题


I want find every incident of ATG...TAG or ATG...TAA. I have tried the following:

#!/usr/bin/perl
use warnings;
use strict; 

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /((?=(ATG\w+?TAG|ATG\w+?TAA))/g){ 
    print "$1\n";           
} 

which gives-

ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAAATGAAAAATAG
ATGAAAAATAG

I want -

ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAA
ATGAAAAATAG

What im doing wrong?


回答1:


You are actually very close, it appears from your statement above that you have two captures, when I think you really only want a single one; I also don't think you need the lookahead.

#!/usr/bin/perl
use warnings;
use strict;

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /(ATG\w+?TA[AG])/g){
    print "$1\n";
}

# output
# ATGCCCCCCCCCCCCCTAG
# ATGAAAAAAAAAATAA
# ATGAAAAATAG

Line by line:

ATG matches a literal ATG

\w+? optionally matches one or more characters

TA[AG] matches a literal TAA or TAG




回答2:


/(ATG\w+?TA[AG])/ works and is a bit more elegant than what FlyingFrog proposed ;-)

-bash-3.2$ perl
my $string = 'ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC';
my @matches = $string =~ /(ATG\w+?TA[AG])/g;
use Data::Dumper;
print Dumper \@matches;
$VAR1 = [
          'ATGCCCCCCCCCCCCCTAG',
          'ATGAAAAAAAAAATAA',
          'ATGAAAAATAG'
        ];



回答3:


Your code will find sequences starting with ATG and ending in TAG or TAA - whichever comes first. If you removed all the TAGs from your sequence, you would find the stretches that end in TAA. By making two capture groups (one for ATG...TAG and one for ATG...TAA) you will find all sequences.

#!/usr/bin/perl
use warnings;
use strict; 

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /(?=(ATG\w+?TAG))(?=(ATG\w+?TAA))/g){ # makes two capture groups 
    print "$1\n";
    print "$2\n";           
} 

Output:

ATGCCCCCCCCCCCCCTAG
ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAA
ATGAAAAAAAAAATAAATGAAAAATAG
ATGAAAAAAAAAATAA

---- OR: ----

#!/usr/bin/perl
use warnings;
use strict; 

my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC');

while($file =~ /(?=(ATG\w+?TA[AG]))/g){ 
    print "$1\n";
} 

Output:

ATGCCCCCCCCCCCCCTAG
ATGAAAAAAAAAATAA
ATGAAAAATAG

Depending on what exactly you're after...



来源:https://stackoverflow.com/questions/18593776/find-multiple-matches-of-this-and-that-nucleotide-sequence

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!