bioinformatics

Why is Collections.counter so slow?

家住魔仙堡 提交于 2019-11-27 22:56:02
I'm trying to solve a Rosalind basic problem of counting nucleotides in a given sequence, and returning the results in a list. For those ones not familiar with bioinformatics it's just counting the number of occurrences of 4 different characters ('A','C','G','T') inside a string. I expected collections.Counter to be the fastest method (first because they claim to be high-performance, and second because I saw a lot of people using it for this specific problem). But to my surprise this method is the slowest ! I compared three different methods, using timeit and running two types of experiments:

Find points over and under the confidence interval when using geom_stat / geom_smooth in ggplot2

此生再无相见时 提交于 2019-11-27 18:56:00
问题 I have a scatter plot,I want to know how can I find the genes above and below the confidence interval lines? EDIT: Reproducible example: library(ggplot2) #dummy data df <- mtcars[,c("mpg","cyl")] #plot ggplot(df,aes(mpg,cyl)) + geom_point() + geom_smooth() 回答1: I had to take a deep dive into the github repo but I finally got it. In order to do this you need to know how stat_smooth works. In this specific case the loess function is called to do the smoothing (the different smoothing functions

Collapse intersecting regions

ぐ巨炮叔叔 提交于 2019-11-27 12:55:27
I am trying to find a way to collapse rows with intersecting ranges, denoted by "start" and "stop" columns, and record the collapsed values into new columns. For example I have this data frame: my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), name=c("a","b","c","d","e","f","g"), start=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), stop=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952))) chrom name start stop 1 a 0 71200 1 b 70001 71200 1 c 70203 80001 1 d 70060 71051 14 e 40004 42004 16 f 50000872 50000890 16 g 50000872 51000952 And I am trying to find the

Finding overlap in ranges with R

孤街醉人 提交于 2019-11-27 11:50:59
I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop . Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution. rangesA:

Dataframe processing

倖福魔咒の 提交于 2019-11-27 08:36:53
问题 I have a dataframe, which I read by Match <- read.table("Match.txt", sep="", fill =T, stringsAsFactors = FALSE, quote = "", header = F) and looks like this: > ab V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 Inspecting sequence ID chr1:173244300-173244500 NA NA 2 V$ATF3_Q6 | 19 (-) | 0.877 | 0.622 | aagtccCATCAggg 3 V$ATF3_Q6 | 34 (-) | 0.788 | 0.655 | agggaaCGACAcag 4 V$ATF3_Q6 | 102 (+) | 0.738 | 0.685 | cccTGAGCttagga 5 V$CEBPB_01 | 24 (+) | 0.950 | 0.882 | ccatcagGGAAGgg 72 V$YY1_01 | 117 (+) | 0.996

Itertools to generate scrambled combinations

心已入冬 提交于 2019-11-27 08:26:26
问题 What I want to do is obtain all combinations and all unique permutations of each combination. The combinations with replacement function only gets me so far: from itertools import combinations_with_replacement as cwr foo = list(cwr('ACGT', n)) ## n is an integer My intuition on how to move forward is to do something like this: import numpy as np from itertools import permutations as perm bar = [] for x in foo: carp = list(perm(x)) for i in range(len(carp)): for j in range(i+1,len(carp)): if

Pyparsing: extract variable length, variable content, variable whitespace substring

筅森魡賤 提交于 2019-11-27 07:20:03
问题 I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another number. Humans typed these in over two decades. Various conventions of whitespace and modifiers are included. Below is my Backus-Naur form so far, and two example records. Just for prostatectomies, we're looking at upwards of a thousand cases. I am using pyparsing because I'm learning python, and have no fond memories

Why can't python find some modules when I'm running CGI scripts from the web?

坚强是说给别人听的谎言 提交于 2019-11-27 05:20:53
I have no idea what could be the problem here: I have some modules from Biopython which I can import easily when using the interactive prompt or executing python scripts via the command-line. The problem is, when I try and import the same biopython modules in a web-executable cgi script, I get a "Import Error" : No module named Bio Any ideas here? Here are a couple of possibilities: Apache (on Unix) generally runs as a different user, and with a different environment, to python from the command line. Try making a small script that just prints out sys.version and sys.prefix , and compare the

R extract part of string

寵の児 提交于 2019-11-27 01:34:58
问题 I have a question about extracting a part of a string. For example I have a string like this: a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0" I need to extract everything between GN= and ; .So here it will be NOC2L . Is that possible? Note: This is INFO column form VCF file format. GN is Gene Name, so we want to extract gene name from INFO column. 回答1: Try this: sub(".*?GN=(.*?);.*", "\\1", a) # [1] "NOC2L" 回答2: Assuming semicolons separate

How to plot a gene graph for a DNA sequence say ATGCCGCTGCGC?

扶醉桌前 提交于 2019-11-26 22:35:21
问题 I need to generate a random walk based on the DNA sequence of a virus, given its base pair sequence of 2k base pairs. The sequence looks like "ATGCGTCGTAACGT". The path should turn right for an A, left for a T, go upwards for a G and downwards for a C. How can I use either Matlab, Mathematica or SPSS for this purpose? 回答1: I did not previously know of Mark McClure's blog about Chaos Game representation of gene sequences, but it reminded me of an article by Jose Manuel Gutiérrez (The