Find all patterns in a multifasta file, including overlapping motifs

对着背影说爱祢 提交于 2021-02-18 19:01:54

问题


I have a multifasta file, it looks like this:

>NP_001002156.1
MKTAVDRRKLDLLYSRYKDPQDENKIGVDGIQQFCDDLMLDPASVSVLIVAWKFRAATQCEFSRQEFLDG
MTDLGCDSPEKLKSLLPRLEQELKDSGKFRDFYRFTFSFAKSPGQKCLDLEMAVAYWNLILSGRFKFLGL
WNTFLLEHHKKSIPKDTWNLLLDFGNMIADDMSNYAEEGAWPVLIDDFVEFARPIVTAENLQTL
>NP_957070.2
MAKDAGLKETNGEIKLFINQSPGKAAGVLQLLTVHPASITTVKQILPKTLTVTGAHVLPHMVVSTPQRPT
IPVLLTSPHTPTAQTQQESSPWSSGHCRRADKSGKGLRHFSMKVCEKVQKKVVTSYNEVADELVQEFSSA
DHSSISPNDAVSSCHVYDQKNIRRRVYDALNVLMAMNIISKDKKEIKWIGFPTNSAQECEDLKAERQRRQ
ERIKQKQSQLQELIVQQIAFKNLVQRNREVEQQSKRSPSANTIIQLPFIIINTSKKTIIDCSISNDKFEY
LFNFDSMFEIHDDVEVLKRLGLALGLESGRCSAEQMKIATSLVSKALQPYVTEMAQGSVNQPMDFSHVAA
ERRASSSTSSRVETPTSLMEEDEEDEEEDYEEEDD
>NP_123456.1
MALLLLLGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
...

Although there is a great python script to handle motif searches in a multifasta file (https://www.biostars.org/p/14305/), if pattern "[KHR]{3}" was used, it would return only motif list and many empty results:

>NP_001002156.1
:['RRK']
>NP_001002156.1
:[]
>NP_001002156.1
:['HHK']
>NP_957070.2
:[]
>NP_957070.2
:['RRR']
...

and some motif (HKK) was leaked in the same sequence.

Here I found another python script:

#coding:utf-8
import re
pattern = "[KHR]{3}"
with open('seq.fasta') as fh:
    fh.readline() 
    seq = ""
    for line in fh:
         seq += line.strip() 
rgx = re.compile(pattern)
result = rgx.search(seq)
patternfound = result.group()
span = result.span()
leftpos = span[0]-10
if leftpos < 0:
   leftpos = 0
print(seq[leftpos:span[0]].lower() + patternfound + seq[span[1]:span[1]+10].lower())

it returns the first matched motif found in a context (forward 10 amino acids after the matched motif, and backward 10 before the matched motif) for only one fasta (the 1st one) sequence, for the first fasta sequence NP_001002156.1 using the scirpt, the returned result:

mktavdRRKldllysrykd

but it has no file header">NP_001002156.1" and other 2 motifs in context were all ommitted:

glwntfllehHHKksipkdtwnl
lwntfllehhHKKsipkdtwnll

Here, I want the desired script to return matched motif with its postition in a context of each fasta sequence in the multifasta file, and it would present the results as following:

>NP_001002156.1_matchnumber_1_(7~9)
mktavdrRRKldllysrykd
>NP_001002156.1_matchnumber_2_(148~150) 
glwntfllehHHKksipkdtwnl
>NP_001002156.1_matchnumber_3_(149~151)
lwntfllehhHKKsipkdtwnll
>NP_957070.2_matchnumber_1_(163~165)
chvydqknirRRRvydalnvlma
>NP_123456.1
no match found

Note: The positon of matched pattern is not the position of context.

Anyone could help me? Thanks in advance.


回答1:


The "motif" here is any three-long combination of [HKR] characters; motifs may overlap.

The overlapping is resolved by using a "lookahead" in the regex. See details below. Neither of quoted or shown resources seem to handle that and I don't see how they would catch overlapping motifs.

use warnings;
use strict;
use feature 'say';

my $file = shift || die "Usage: $0 fasta-file\n";    
open my $fh, '<', $file or die "Can't open $file: $!";

my ($seq, $seq_name);
while (<$fh>) {
    chomp;
    if (/^>(.*)/) {
        # Process the previous assembled sequence
        if ($seq) {
            proc_seq($seq_name, $seq);
            $seq = ''; 
        }
        $seq_name = $1; 
        next;
    }   
    $seq .= $_; 
}
# Process the last one    
proc_seq($seq_name, $seq);

sub proc_seq {
    my ($seq_name, $seq, $multiline) = @_; 

    # Build output in the loop, as motifs are found. By default, print all
    # output for one seq_name in one line. To print each motif on its own
    # line instead, invoke this sub with a true third argument (1 will do).
    my $output = ">$seq_name";

    my $cnt = 0;
    while ($seq =~ /([HKR])(?=([HKR]{2}))/g) { 
        ++$cnt;
        my $motif = $1 . $2; 
        my $pos = pos($seq);
        my $pre_context = ($pos >= 11) 
            ? substr($seq, $pos-11, 10) 
            : substr($seq, 0,       $pos-1);
        my $post_context = substr $seq, $pos+2, 10;

        $output .= " n$cnt($pos~" . ($pos+2) . ") ";
        $output .= "\n"  if $multiline;
        $output .= lc($pre_context) . $motif . lc($post_context);
    } 
    say ($cnt > 0  ? $output  : $output . ' no match found');
}

Note on the regex: we need a lookahead for the second and third character in order to be able to catch the overlapping motifs as well.

An example. There is HHKK in the first sequence, with overlapping motifs HHK and HKK. If the regex matches HHK using /[HKR]{3}/ then after that the position of the regex engine in the string is after the first K, as it "consumed" HHK. So all it can see next is just one K and so there is no [HKR]{3} to match next, and it thus misses the next motif.

So, instead, I match only one letter and do a "lookahead" for the next two. Then after matching H (and "seeing" that there is indeed HK following) only one letter is consumed and the engine got past only that first H, and it is positioned before the second H for the next match. Now it will be able to next match the HKK, in the same manner (and so it can keep matching even multiply overlapping motifs).

This identifies everything indicated in the desired output (which has a typo); note the change in the requirements in the comment, to print all motifs for one sequence on one line. So it prints

>NP_001002156.1 n1(7~9) mktavdRRKldllysrykd n2(148~150) lglwntflleHHKksipkdtwnl n3(149~151) glwntfllehHKKsipkdtwnll
>NP_957070.2 n1(163~165) schvydqkniRRRvydalnvlma
>NP_bogus_with_no_motifs  no match found

with all motifs for the same sequence name on one line, as wanted. I've added a bogus line to input, with no motifs, to test the no match found addition; this drew the last line in the output above.


There is still an option to print each motif on a separate line, as was originally wanted: invoke the proc_seq function with an additional, third, argument which is true, like

proc_seq($seq_name, $seq, 1)

and then it'll print

>NP_001002156.1 n1(7~9) 
mktavdRRKldllysrykd n2(148~150) 
lglwntflleHHKksipkdtwnl n3(149~151) 
glwntfllehHKKsipkdtwnll
>NP_957070.2 n1(163~165) 
schvydqkniRRRvydalnvlma
>NP_bogus_with_no_motifs  no match found


来源:https://stackoverflow.com/questions/54140487/find-all-patterns-in-a-multifasta-file-including-overlapping-motifs

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!