bioinformatics | 易学教程

Can I use K-means algorithm on a string?

阅读更多关于 Can I use K-means algorithm on a string?

问题 I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation. I was thinking of using the k-means

How do I change this to “idiomatic” Perl?

阅读更多关于 How do I change this to “idiomatic” Perl?

I am beginning to delve deeper into Perl, but am having trouble writing "Perl-ly" code instead of writing C in Perl. How can I change the following code to use more Perl idioms, and how should I go about learning the idioms? Just an explanation of what it is doing: This routine is part of a module that aligns DNA or amino acid sequences(using Needelman-Wunch if you care about such things). It creates two 2d arrays, one to store a score for each position in the two sequences, and one to keep track of the path so the highest-scoring alignment can be recreated later. It works fine, but I know I

Faster way to split a string and count characters using R?

阅读更多关于 Faster way to split a string and count characters using R?

问题 I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider. I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this: ## ## count the number of GCs in the characters between start and stop ## gcCount <- function(line, st, sp){ chars = strsplit(as

“average length of the sequences in a fasta file”: Can you improve this Erlang code?

阅读更多关于 “average length of the sequences in a fasta file”: Can you improve this Erlang code?

I'm trying to get the mean length of fasta sequences using Erlang . A fasta file looks like this >title1 ATGACTAGCTAGCAGCGATCGACCGTCGTACGC ATCGATCGCATCGATGCTACGATCGATCATATA ATGACTAGCTAGCAGCGATCGACCGTCGTACGC ATCGATCGCATCGATGCTACGATCTCGTACGC >title2 ATCGATCGCATCGATGCTACGATCTCGTACGC ATGACTAGCTAGCAGCGATCGACCGTCGTACGC ATCGATCGCATCGATGCTACGATCGATCATATA ATGACTAGCTAGCAGCGATCGACCGTCGTACGC >title3 ATCGATCGCATCGAT(...) I tried to answser this question using the following Erlang code: -module(golf). -export([test/0]). line([],{Sequences,Total}) -> {Sequences,Total}; line(">" ++ Rest,{Sequences,Total}) ->

Clojure or Scala for bioinformatics/biostatistics/medical research [closed]

阅读更多关于 Clojure or Scala for bioinformatics/biostatistics/medical research [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 3 years ago . I am not a professional programmer (my area is medical research), but I am quite capable in C/C++, and various scripting languages. A while back I got intrigued by Lisp, but I never got the time to seriously learn it. After a brief exposure to R I decided to invest more time

Traceback in Smith-Wateman algorithm with affine gap penalty

阅读更多关于 Traceback in Smith-Wateman algorithm with affine gap penalty

I'm trying to implement the Smith-Waterman algorithm for local sequence alignment using the affine gap penalty function. I think I understand how to initiate and compute the matrices required for calculating alignment scores, but am clueless as to how to then traceback to find the alignment. To generate the 3 matrices required I have the following code for j in range(1, len2): for i in range(1, len1): fxOpen = F[i][j-1] + gap xExtend = Ix[i][j-1] + extend Ix[i][j] = max(fxOpen, xExtend) fyOpen = F[i-1][j] + gap yExtend = Iy[i-1][j] + extend Iy[i][j] = max(fyOpen, yExtend) matchScore = (F[i-1]

How to plot positions along a chromosome graphic

阅读更多关于 How to plot positions along a chromosome graphic

I would like to generate a plot depicting 14 linear chromosomes for the organism I work on, to scale, with coloured bars at specified locations along each chromosome. Ideally I'd like to use R as this is the only programming language I have experience with. I have explored various ways of doing this e.g. with GenomeGraphs but I have found this is all more complicated than what I want/ displays a lot more data than what I have (e.g. displaying cytogenic bands) and is often specific for human chromosomes. All I essentially want is 14 grey bars of the following sizes: chromosome size 1 640851 2

Which functional programming languages have bioinformatics libraries? [closed]

阅读更多关于 Which functional programming languages have bioinformatics libraries? [closed]

Closed . This question needs to be more focused. It is not currently accepting answers. Learn more . Want to improve this question? Update the question so it focuses on one problem only by editing this post . Which functional programming languages have bioinformatics libraries easily available? (Don't include multi-paradigm languages such as Ruby ) Update : Listing which major functional programming languages don't currently have easy access to bioinformatics libraries is also welcome. Do you consider R as a functional and not multi-paradigm language? If so, R has the biggest set of libraries

How to call module written with argparse in iPython notebook

阅读更多关于 How to call module written with argparse in iPython notebook

I am trying to pass BioPython sequences to Ilya Stepanov's implementation of Ukkonen's suffix tree algorithm in iPython's notebook environment. I am stumbling on the argparse component. I have never had to deal directly with argparse before. How can I use this without rewriting main()? By the by, this writeup of Ukkonen's algorithm is fantastic . I've had a similar problem before, but using optparse instead of argparse . You don't need to change anything in the original script, just assign a new list to sys.argv like so: if __name__ == "__main__": from Bio import SeqIO path = '/path/to

Algorithm to decide cut-off for collapsing this tree?

阅读更多关于 Algorithm to decide cut-off for collapsing this tree?

I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of putative DNA regulatory motifs that are 4-9 bp long DNA sequences. An interactive version of the tree is up on iTol ( here ), which you can freely play with - just press "update tree" after setting your parameters: My specific goal: to collapse the motifs (tips/terminal nodes/leaves) together if their average distances to the nearest parent clade is < X ( ETE2 Python package ). This is biologically interesting since some of the gene regulatory DNA motifs may be