bioinformatics

how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R

可紊 提交于 2019-12-03 21:40:32
I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern ( a function from the Biostrings R package) to find theses amino acids: As an example mydata.txt could be: >a atgaatgctaaccccaccgagtaa >b atgctaaccactgtcatcaatgcctaa >c atggcatgatgccgagaggccagaataggctaa >d atggtgatagctaacgtatgctag >e atgccatgcgaggagccggctgccattgactag file=read.fasta(file="mydata.txt") matchPattern( "atg" , file) Note: read.fasta is a function in seqinr package

“average length of the sequences in a fasta file”: Can you improve this Erlang code?

旧时模样 提交于 2019-12-03 13:27:50
问题 I'm trying to get the mean length of fasta sequences using Erlang . A fasta file looks like this >title1 ATGACTAGCTAGCAGCGATCGACCGTCGTACGC ATCGATCGCATCGATGCTACGATCGATCATATA ATGACTAGCTAGCAGCGATCGACCGTCGTACGC ATCGATCGCATCGATGCTACGATCTCGTACGC >title2 ATCGATCGCATCGATGCTACGATCTCGTACGC ATGACTAGCTAGCAGCGATCGACCGTCGTACGC ATCGATCGCATCGATGCTACGATCGATCATATA ATGACTAGCTAGCAGCGATCGACCGTCGTACGC >title3 ATCGATCGCATCGAT(...) I tried to answser this question using the following Erlang code: -module(golf).

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

落爺英雄遲暮 提交于 2019-12-03 12:47:36
问题 I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives. (The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) . Now, the classifiers have roughly similar performance metrics

Traceback in Smith-Wateman algorithm with affine gap penalty

百般思念 提交于 2019-12-03 12:37:07
问题 I'm trying to implement the Smith-Waterman algorithm for local sequence alignment using the affine gap penalty function. I think I understand how to initiate and compute the matrices required for calculating alignment scores, but am clueless as to how to then traceback to find the alignment. To generate the 3 matrices required I have the following code for j in range(1, len2): for i in range(1, len1): fxOpen = F[i][j-1] + gap xExtend = Ix[i][j-1] + extend Ix[i][j] = max(fxOpen, xExtend)

Which functional programming languages have bioinformatics libraries? [closed]

北城以北 提交于 2019-12-03 11:47:09
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 6 years ago . Which functional programming languages have bioinformatics libraries easily available? (Don't include multi-paradigm languages such as Ruby) Update : Listing which major functional programming languages don't currently have easy access to bioinformatics libraries is also welcome.

How to call module written with argparse in iPython notebook

China☆狼群 提交于 2019-12-03 10:49:50
问题 I am trying to pass BioPython sequences to Ilya Stepanov's implementation of Ukkonen's suffix tree algorithm in iPython's notebook environment. I am stumbling on the argparse component. I have never had to deal directly with argparse before. How can I use this without rewriting main()? By the by, this writeup of Ukkonen's algorithm is fantastic. 回答1: I've had a similar problem before, but using optparse instead of argparse . You don't need to change anything in the original script, just

trace patterns such that each node is visited only once(eulerian path) using opencv

ぃ、小莉子 提交于 2019-12-03 09:10:20
Here is my problem which I am trying to solve since one complete year. With no success till end of the year. I have to seek help and a concrete solutions from the stackoverflow experts. My problem statement: I have been working with some design patterns which I want to trace if eulerian path exist(as shown in below gifs), programmatically. Below are the patterns and the way I wanna draw them(gifs). What I wanna achieve: Give the design pattern images as input. I want trace the design pattern image in a single stroke as shown in the gifs(gifs animations are just examples of how the patterns is

Algorithm to decide cut-off for collapsing this tree?

家住魔仙堡 提交于 2019-12-03 09:06:52
问题 I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of putative DNA regulatory motifs that are 4-9 bp long DNA sequences. An interactive version of the tree is up on iTol (here), which you can freely play with - just press "update tree" after setting your parameters: My specific goal: to collapse the motifs (tips/terminal nodes/leaves) together if their average distances to the nearest parent clade is < X (ETE2 Python

Organizing the output of my shell script into tables within the text file

混江龙づ霸主 提交于 2019-12-03 08:46:53
I am working with a unix shell script that does genome construction then creates a phylogeny. Depending on the genome assembler you use, the final output (the phylogeny) may change. I wish to compare the effects of using various genome assemblers. I have developed some metrics to compare them on, but I need help organizing them so I can run useful analyses. I would like to import my data into excel in columns. This is the script I am using to output data: echo "Enter the size (Mb or Gb) of your data set:" read SIZEOFDATASET echo "The size of your data set is $SIZEOFDATASET" echo "Size of Data

Algorithm help! Fast algorithm in searching for a string with its partner

≯℡__Kan透↙ 提交于 2019-12-03 07:55:49
I am looking for a fast algorithm for search purpose in a huge string (it's a organism genome sequence composed of hundreds of millions to billions of chars). There are only 4 chars {A,C,G,T} present in this string, and "A" can only pair with "T" while "C" pairs with "G". Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely. For example, The string is: ATCAG GACCA TACGC CTGAT Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9,