dna-sequence | 易学教程

Improving code design of DNA alignment degapping

阅读更多关于 Improving code design of DNA alignment degapping

This is a question regarding a more efficient code design: Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences. # Input align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length "seq2":"AT----GC", "seq3":"A--CA--C"} annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]}, "seq2":{"gene1":[0,3], "gene2":[4,7]}, "seq3":{"gene1":[0,3], "gene2":[4,7]}} I wish to remove the gaps (i.e., dashes) from the alignment and maintain the

Fast algorithms for finding unique sets in two very long sequences of text

阅读更多关于 Fast algorithms for finding unique sets in two very long sequences of text

I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here. One of the reasons I put this on SO was to get perspective from people outside the specific application domain, who can put forth algorithms they use in string comparison in their daily use,

Optimal way to cluster set of strings with hamming distance [duplicate]

阅读更多关于 Optimal way to cluster set of strings with hamming distance [duplicate]

This question already has an answer here: Fast computation of pairs with least hamming distance 1 answer Finding Minimum hamming distance of a set of strings in python 4 answers I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a , b , c or d . I would like to find the closest strings for each one , closest defines as having the smallest hamming distance . I would like to find the k-nearest strings for each one (k < 5). Example N = 5 i1 = aacbdbbb i2 = abcbdbbb i3 = bbcadabd i4 = bbcadabb HammingDistance(i1,i2) = 1 HammingDistance(i1,i3) = 5

chaos game for DNA sequences

阅读更多关于 chaos game for DNA sequences

问题 I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars];

matching and counting strings (k-mer of DNA) in R

阅读更多关于 matching and counting strings (k-mer of DNA) in R

问题 I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list. Lets say my list includes 5 members: DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA") I want set k=2 (2-mer) so 4^2=16 combination are available including AA

Translation DNA to Protein

阅读更多关于 Translation DNA to Protein

I am a biology graduate student and I taught myself a very limited amount of python in the past few months to deal with some data I have. I am not asking for homework help, this is for a research project. With this code I intend to take a portion of a string called sequence, between: find the start site of "protein translation," or the first occurrence of ATG (biological term is start codon), then the first occurrence of TAA (stop codon). Then the function translate_dna() should, for every three letters in the string, swap for the dictionary value. The variable CDS exists properly, but for, or

matching and counting strings (k-mer of DNA) in R

阅读更多关于 matching and counting strings (k-mer of DNA) in R

I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list. Lets say my list includes 5 members: DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA") I want set k=2 (2-mer) so 4^2=16 combination are available including AA,AT,AC,AG,TA,TT,... So my table will have 5 rows and 16 columns . I want to count number of matches

chaos game for DNA sequences

阅读更多关于 chaos game for DNA sequences

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars]; Graphics[{PointSize[Tiny], Point[pts]}] the fasta sequence that I have is just a sequence of letters

R- How to plot correct pie charts in haploNet haplotyp Networks {pegas} {ape} {adegenet}

阅读更多关于 R- How to plot correct pie charts in haploNet haplotyp Networks {pegas} {ape} {adegenet}

When using the haploNet package to make some plots on a haplotype network, I used a script available on the internet to do so. However I think there is something wrong. The script is available in form of the woodmouse example. The code I used is: x <- read.dna(file="Masto.fasta",format="fasta") h <- haplotype(x) net <- haploNet(h) plot(net) plot(net, size = attr(net, "freq"), fast = TRUE) plot(net, size = attr(net, "freq")) plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8 table(rownames(x)) ind.hap<-with( stack(setNames(attr(h, "index"), rownames(h))), table(hap=ind, pop=rownames(x

R- How to plot correct pie charts in haploNet haplotyp Networks {pegas} {ape} {adegenet}

阅读更多关于 R- How to plot correct pie charts in haploNet haplotyp Networks {pegas} {ape} {adegenet}

问题 When using the haploNet package to make some plots on a haplotype network, I used a script available on the internet to do so. However I think there is something wrong. The script is available in form of the woodmouse example. The code I used is: x <- read.dna(file="Masto.fasta",format="fasta") h <- haplotype(x) net <- haploNet(h) plot(net) plot(net, size = attr(net, "freq"), fast = TRUE) plot(net, size = attr(net, "freq")) plot(net, size=attr(net, "freq"), scale.ratio = 2, cex = 0.8 table