dna-sequence

Fast algorithms for finding unique sets in two very long sequences of text

橙三吉。 提交于 2019-12-22 01:49:23
问题 I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here. One of the reasons I put this on SO was to get perspective from people outside the

Fast algorithms for finding unique sets in two very long sequences of text

余生长醉 提交于 2019-12-22 01:47:10
问题 I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here. One of the reasons I put this on SO was to get perspective from people outside the

How to convert a set of DNA sequences into protein sequences using python programming?

被刻印的时光 ゝ 提交于 2019-12-21 20:21:55
问题 I am using python to create a program that converts a set of DNA sequences into amino acid (protein) sequences. I then need to find a specific subsequence, and count the number of sequences in which this specific subsequence is present. This is the code I have so far: #Open cDNA_sequences file and read in line by line with open('cDNA_sequences.csv', 'r') as results: for line in results: columns = line.rstrip("\n").split(",") #remove end of line characters and split commas to produce a list

Search for string allowing for one mismatch in any location of the string

送分小仙女□ 提交于 2019-12-17 10:37:41
问题 I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000 sequences. I need to look for each of my sequences of 25 characters, for example, (AGCCTCCCATGATTGAACAGATCAT). The genome is formatted as a continuous string, i.e.

Overlapping matches in R

无人久伴 提交于 2019-12-17 09:50:04
问题 I have searched and was able to find this forum discussion for achieving the effect of overlapping matches. I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language. I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.

Creating a hash of arrays for DNA sequences, Perl

≯℡__Kan透↙ 提交于 2019-12-13 03:14:02
问题 I have a hash called %id2seq that contains strings of DNA sequences that are referenced by the key $id . I want to be able to manipulate the DNA sequences by using a position within the string as a reference. For example, if my DNA sequence was ACGTG , my $id would be Sequence 1 , my $id2seq{'Sequence 1'} would be ACGTG , and my "theoretical" $id2seq{'Sequence 1'}[3] would be G . I am attempting to create a hash of arrays to do this, but I'm getting a weird output (see below output). I'm

Improving code design of DNA alignment degapping

ぐ巨炮叔叔 提交于 2019-12-12 09:38:02
问题 This is a question regarding a more efficient code design: Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences. # Input align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,3], "gene2":[4,7]}, "seq3":{"gene1":

Generating Synthetic DNA Sequence with Substitution Rate

纵饮孤独 提交于 2019-12-12 08:07:40
问题 Given these inputs: my $init_seq = "AAAAAAAAAA" #length 10 bp my $sub_rate = 0.003; my $nof_tags = 1000; my @dna = qw( A C G T ); I want to generate: One thousand length-10 tags Substitution rate for each position in a tag is 0.003 Yielding output like: AAAAAAAAAA AATAACAAAA ..... AAGGAAAAGA # 1000th tags Is there a compact way to do it in Perl? I am stuck with the logic of this script as core: #!/usr/bin/perl my $init_seq = "AAAAAAAAAA" #length 10 bp my $sub_rate = 0.003; my $nof_tags = 1000

Reading at three different frames

若如初见. 提交于 2019-12-11 19:27:27
问题 So I'm trying to create a class that reads a DNA string in three different frames - one that starts at position 0 (or the first base), another that starts in position 1 (the second base), and a third that starts reading at position 2 (the third base). So far, this is what I've been playing around with: def codons(self, frame_one, frame_two, frame_three): start = frame_one while start + 3 <=len(self.seq): yield (self.seq[start:start+3], start) start += 3 start+1 = frame_two while start + 3 <

Java program malfunction

扶醉桌前 提交于 2019-12-11 11:55:29
问题 First half of my question: When I try to run my program it loads and loads forever; it never shows the results. Could someone check out my code and spot an error somewhere. This program is meant to find a start DNA codon ATG and keep looking until finding a stop codon TAA or TAG or TGA, and then print out the gene from start to stop. I'm using BlueJ. Second half of my question: I'm supposed to write a program in which the following steps are needed to be taken: To find the first gene, find