dna-sequence

Improving code design of DNA alignment degapping

别等时光非礼了梦想. 提交于 2019-12-05 05:05:41
This is a question regarding a more efficient code design: Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences. # Input align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,3], "gene2":[4,7]}, "seq3":{"gene1":[0,3], "gene2":[4,7]}} I wish to remove the gaps (i.e., dashes) from the alignment and maintain the

Fast algorithms for finding unique sets in two very long sequences of text

吃可爱长大的小学妹 提交于 2019-12-04 23:18:43
I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here. One of the reasons I put this on SO was to get perspective from people outside the specific application domain, who can put forth algorithms they use in string comparison in their daily use,

Optimal way to cluster set of strings with hamming distance [duplicate]

无人久伴 提交于 2019-12-04 22:25:52
This question already has an answer here: Fast computation of pairs with least hamming distance 1 answer Finding Minimum hamming distance of a set of strings in python 4 answers I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a , b , c or d . I would like to find the closest strings for each one , closest defines as having the smallest hamming distance . I would like to find the k-nearest strings for each one (k < 5). Example N = 5 i1 = aacbdbbb i2 = abcbdbbb i3 = bbcadabd i4 = bbcadabb HammingDistance(i1,i2) = 1 HammingDistance(i1,i3) = 5

chaos game for DNA sequences

一曲冷凌霜 提交于 2019-12-04 09:34:59
问题 I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars];

matching and counting strings (k-mer of DNA) in R

◇◆丶佛笑我妖孽 提交于 2019-12-03 13:31:06
问题 I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list. Lets say my list includes 5 members: DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA") I want set k=2 (2-mer) so 4^2=16 combination are available including AA

Translation DNA to Protein

别来无恙 提交于 2019-12-03 09:22:02
I am a biology graduate student and I taught myself a very limited amount of python in the past few months to deal with some data I have. I am not asking for homework help, this is for a research project. With this code I intend to take a portion of a string called sequence, between: find the start site of "protein translation," or the first occurrence of ATG (biological term is start codon), then the first occurrence of TAA (stop codon). Then the function translate_dna() should, for every three letters in the string, swap for the dictionary value. The variable CDS exists properly, but for, or

matching and counting strings (k-mer of DNA) in R

你说的曾经没有我的故事 提交于 2019-12-03 03:28:09
I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list. Lets say my list includes 5 members: DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA") I want set k=2 (2-mer) so 4^2=16 combination are available including AA,AT,AC,AG,TA,TT,... So my table will have 5 rows and 16 columns . I want to count number of matches

chaos game for DNA sequences

只愿长相守 提交于 2019-12-03 03:25:01
I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars]; Graphics[{PointSize[Tiny], Point[pts]}] the fasta sequence that I have is just a sequence of letters

R- How to plot correct pie charts in haploNet haplotyp Networks {pegas} {ape} {adegenet}

不想你离开。 提交于 2019-12-01 11:28:22
When using the haploNet package to make some plots on a haplotype network, I used a script available on the internet to do so. However I think there is something wrong. The script is available in form of the woodmouse example. The code I used is: x <- read.dna(file="Masto.fasta",format="fasta") h <- haplotype(x) net <- haploNet(h) plot(net) plot(net, size = attr(net, "freq"), fast = TRUE) plot(net, size = attr(net, "freq")) plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8 table(rownames(x)) ind.hap<-with( stack(setNames(attr(h, "index"), rownames(h))), table(hap=ind, pop=rownames(x

R- How to plot correct pie charts in haploNet haplotyp Networks {pegas} {ape} {adegenet}

做~自己de王妃 提交于 2019-12-01 07:36:28
问题 When using the haploNet package to make some plots on a haplotype network, I used a script available on the internet to do so. However I think there is something wrong. The script is available in form of the woodmouse example. The code I used is: x <- read.dna(file="Masto.fasta",format="fasta") h <- haplotype(x) net <- haploNet(h) plot(net) plot(net, size = attr(net, "freq"), fast = TRUE) plot(net, size = attr(net, "freq")) plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8 table