This is a question regarding a more efficient code design:
Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.
# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.
# Desired output
align = {"seq1":"ATGCATGC",
"seq2":"ATGC",
"seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,1], "gene2":[2,3]},
"seq3":{"gene1":[0,1], "gene2":[2,3]}}
Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.
1 measure length of any aligned gene # take any seq, since all seqs aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4 outseq = ""
5 for each num in range(0, length(seq)) # weird for-loop is intentional
6 if seq[num] == "-"
7 current_gene = gene whose start/stop positions include num
8 subtract 1 from length of current_gene
9 subtract 1 from lengths of all genes following current_gene in list_lengths
10 else
11 append seq[num] to outseq
12 append outseq to new variable
13 convert gene lengths into start/stop positions and append ordered to new variable
Can anyone give me suggestions/examples for a shorter, more direct code design?
This answer handles your updated annos
dictionary from the comment to cdlanes answer. That answer leaves the annos
dictionary with the incorrect index of [2,1] for seq2
gene2
. My proposed solution will remove the gene
entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align
, then anno[geneX]
will have equal indices for start and stop --> See seq3
gene1
from your commented annos
.
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}
import re
for name,anno in annos.items():
# indices of gaps removed usinig re
removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]
# removes gaps from align dictionary
align[name] = re.sub(r'-', '', align[name])
build_dna = ''
for gene,inds in anno.items():
start_ind = len(build_dna)+1
#generator to sum the num '-' removed from gene
num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])
# build the de-gapped string
build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")
end_ind = len(build_dna)
if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
del annos[name][gene] #remove the gene entry
continue
#update the values in the annos dictionary
annos[name][gene][0] = start_ind-1
annos[name][gene][1] = end_ind-1
Results:
In [3]: annos
Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
Results from the 3 gene annos
above. Just replace the annos
variable:
In [5]: annos3
Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}
The following matches the output of example program for both test cases:
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
(START, STOP) = (0, 1)
for alignment, sequence in align.items():
new_sequence = ''
gap = 0
for position, codon in enumerate(sequence):
if '-' == codon:
for gene in annos[alignment].values():
if gene[START] > (position - gap):
gene[START] -= 1
if gene[STOP] >= (position - gap):
gene[STOP] -= 1
gap += 1
else:
new_sequence += codon
align[alignment] = new_sequence
The result of running this:
python3 -i test.py
>>> align
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'}
>>>
>>> annos
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
>>>
I hope this is still elegant, direct, short and Pythonic enough!
My own solution to the above question is neither elegant nor Pythonic, but arrives at the desired output. Any recommendations for improvement are highly welcome!
import collections
import operator
# measure length of any aligned gene # take any seq, since all seqs aligned
align_len = len(align.itervalues().next())
# initialize output
align_out, annos_out = {}, {}
# loop through annos
for seqname, anno in annos.items():
# operate on ordered sequence lengths instead on ranges
ordseqlens = collections.OrderedDict()
# generate ordered sequence lengths
for k,v in sorted(anno.items(), key=operator.itemgetter(1)):
ordseqlens[k] = v[1]-v[0]+1
# start (and later append to) sequence output
align_out[seqname] = ""
# generate R-style for-loop
for pos in range(0, len(align[seqname])):
if align[seqname][pos] == "-":
try:
current_gene = next(key for key, a in anno.items() if a[0] <= pos <= a[1])
except StopIteration:
print("No annotation provided for position", pos, "in sequence", seqname)
# subtract 1 from lengths of current_gene
ordseqlens[current_gene] = ordseqlens[current_gene]-1
# append nucleotide unless a gap
else:
align_out[seqname] += align[seqname][pos]
# convert modified ordered sequence lengths back into start/stop positions
summ = 0
tmp_dict = {}
for k,v in ordseqlens.items():
tmp_dict[k] = [summ, v+summ-1]
summ = v+summ
# save start/stop positions to correct annos
annos_out[seqname] = tmp_dict
The output of this code is:
>>> align_out
{'seq3': 'ACAC',
'seq2': 'ATGC',
'seq1': 'ATGCATGC'}
>>> annos_out
{'seq3': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}}
So, I think that the approach of trying to break each sequence up into genes and then remove the dashes is resulting in a lot of unnecessary book-keeping. Instead, it might be easier to look at the dashes directly and then update all of the indices based on their relative positions. Here's a function I wrote that appears to be operating correctly:
from copy import copy
def rewriteGenes(align, annos):
alignments = copy(align)
annotations = copy(annos)
for sequence, alignment in alignments.items():
while alignment.find('-') > -1:
index = alignment.find('-')
for gene, (start, end) in annotations[sequence].items():
if index < start:
annotations[sequence][gene][0] -= 1
if index <= end:
annotations[sequence][gene][1] -= 1
alignment = alignment[:index] + alignment[index+1:]
alignments[sequence] = alignment
return (alignments, annotations)
This iterates over the dashes in each alignment and updates the gene indices as they are removed.
Note that this produces a gene with indices [2,1]
for the following test case:
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}
This is necessary because the way your indices are setup do not otherwise allow for empty genes. For example, the indices [2,2]
would be the sequence of length 1 starting at index 2.
来源:https://stackoverflow.com/questions/34816513/improving-code-design-of-dna-alignment-degapping