Improving code design of DNA alignment degapping

This is a question regarding a more efficient code design:

Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.

# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}

I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.

# Desired output
align = {"seq1":"ATGCATGC", 
         "seq2":"ATGC",
         "seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,1], "gene2":[2,3]},
         "seq3":{"gene1":[0,1], "gene2":[2,3]}}

Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.

1  measure length of any aligned gene  # take any seq, since all seqs aligned
2  list_lengths = list of gene lengths  # order is important
3  for seq in alignment
4      outseq = ""
5      for each num in range(0, length(seq))  # weird for-loop is intentional
6          if seq[num] == "-"
7              current_gene = gene whose start/stop positions include num
8              subtract 1 from length of current_gene
9              subtract 1 from lengths of all genes following current_gene in list_lengths
10         else
11             append seq[num] to outseq
12     append outseq to new variable
13     convert gene lengths into start/stop positions and append ordered to new variable

Can anyone give me suggestions/examples for a shorter, more direct code design?

This answer handles your updated annos dictionary from the comment to cdlanes answer. That answer leaves the annos dictionary with the incorrect index of [2,1] for seq2 gene2. My proposed solution will remove the gene entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align, then anno[geneX] will have equal indices for start and stop --> See seq3 gene1 from your commented annos.

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}


annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

import re
for name,anno in annos.items():
    # indices of gaps removed usinig re
    removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]

    # removes gaps from align dictionary
    align[name] = re.sub(r'-', '', align[name])

    build_dna = ''
    for gene,inds in anno.items():

        start_ind = len(build_dna)+1

        #generator to sum the num '-' removed from gene
        num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])

        # build the de-gapped string
        build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")

        end_ind = len(build_dna)

        if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
            del annos[name][gene] #remove the gene entry
            continue
        #update the values in the annos dictionary
        annos[name][gene][0] = start_ind-1
        annos[name][gene][1] = end_ind-1

Results:

In [3]: annos
Out[3]:  {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
          'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
          'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}

Results from the 3 gene annos above. Just replace the annos variable:

In [5]: annos3
Out[5]:  {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
          'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
          'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}

The following matches the output of example program for both test cases:

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}

(START, STOP) = (0, 1)

for alignment, sequence in align.items():
    new_sequence = ''
    gap = 0

    for position, codon in enumerate(sequence):
        if '-' == codon:
            for gene in annos[alignment].values():
                if gene[START] > (position - gap):
                    gene[START] -= 1
                if gene[STOP] >= (position - gap):
                    gene[STOP] -= 1
            gap += 1
        else:
            new_sequence += codon

    align[alignment] = new_sequence

The result of running this:

python3 -i test.py
>>> align
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'}
>>> 
>>> annos
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
>>>

I hope this is still elegant, direct, short and Pythonic enough!

My own solution to the above question is neither elegant nor Pythonic, but arrives at the desired output. Any recommendations for improvement are highly welcome!

import collections
import operator
# measure length of any aligned gene  # take any seq, since all seqs aligned
align_len = len(align.itervalues().next())
# initialize output
align_out, annos_out = {}, {}
# loop through annos
for seqname, anno in annos.items():
# operate on ordered sequence lengths instead on ranges
    ordseqlens = collections.OrderedDict()
# generate ordered sequence lengths
    for k,v in sorted(anno.items(), key=operator.itemgetter(1)):
        ordseqlens[k] = v[1]-v[0]+1
# start (and later append to) sequence output
    align_out[seqname] = ""
# generate R-style for-loop
    for pos in range(0, len(align[seqname])):
        if align[seqname][pos] == "-":
            try:
                current_gene = next(key for key, a in anno.items() if a[0] <= pos <= a[1])
            except StopIteration:
                print("No annotation provided for position", pos, "in sequence", seqname)
# subtract 1 from lengths of current_gene
            ordseqlens[current_gene] = ordseqlens[current_gene]-1
# append nucleotide unless a gap
        else:
            align_out[seqname] += align[seqname][pos]
# convert modified ordered sequence lengths back into start/stop positions
    summ = 0
    tmp_dict = {}
    for k,v in ordseqlens.items():
        tmp_dict[k] = [summ, v+summ-1]
        summ = v+summ
# save start/stop positions to correct annos
    annos_out[seqname] = tmp_dict

The output of this code is:

>>> align_out
{'seq3': 'ACAC',
 'seq2': 'ATGC',
 'seq1': 'ATGCATGC'}

>>> annos_out
{'seq3': {'gene1': [0, 1], 'gene2': [2, 3]},
 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
 'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}}

So, I think that the approach of trying to break each sequence up into genes and then remove the dashes is resulting in a lot of unnecessary book-keeping. Instead, it might be easier to look at the dashes directly and then update all of the indices based on their relative positions. Here's a function I wrote that appears to be operating correctly:

from copy import copy

def rewriteGenes(align, annos):
    alignments = copy(align)
    annotations = copy(annos)

    for sequence, alignment in alignments.items():
        while alignment.find('-') > -1:
            index = alignment.find('-')
            for gene, (start, end) in annotations[sequence].items():
                if index < start: 
                    annotations[sequence][gene][0] -= 1
                if index <= end: 
                    annotations[sequence][gene][1] -= 1
            alignment = alignment[:index] + alignment[index+1:]
        alignments[sequence] = alignment

    return (alignments, annotations)

This iterates over the dashes in each alignment and updates the gene indices as they are removed.

Note that this produces a gene with indices [2,1] for the following test case:

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
         "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
         "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

This is necessary because the way your indices are setup do not otherwise allow for empty genes. For example, the indices [2,2] would be the sequence of length 1 starting at index 2.

来源：https://stackoverflow.com/questions/34816513/improving-code-design-of-dna-alignment-degapping

标签

python

algorithm

optimization

dna-sequence