bioinformatics

How to randomly extract FASTA sequences using Python?

こ雲淡風輕ζ 提交于 2019-11-26 22:10:14
问题 I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me? A.fasta >chr1:1310706-1310726 GACGGTTTCCGGTTAGTGGAA >chr1:901959-901979 GAGGGCTTTCTGGAGAAGGAG >chr1:983001-983021 GTCCGCTTGCGGGACCTGGGG >chr1

Why is Collections.counter so slow?

纵饮孤独 提交于 2019-11-26 21:14:47
问题 I'm trying to solve a Rosalind basic problem of counting nucleotides in a given sequence, and returning the results in a list. For those ones not familiar with bioinformatics it's just counting the number of occurrences of 4 different characters ('A','C','G','T') inside a string. I expected collections.Counter to be the fastest method (first because they claim to be high-performance, and second because I saw a lot of people using it for this specific problem). But to my surprise this method

Why can't python find some modules when I'm running CGI scripts from the web?

拥有回忆 提交于 2019-11-26 11:35:11
问题 I have no idea what could be the problem here: I have some modules from Biopython which I can import easily when using the interactive prompt or executing python scripts via the command-line. The problem is, when I try and import the same biopython modules in a web-executable cgi script, I get a \"Import Error\" : No module named Bio Any ideas here? 回答1: Here are a couple of possibilities: Apache (on Unix) generally runs as a different user, and with a different environment, to python from

Find the intersection of overlapping ranges in two tables using data.table function foverlaps

只愿长相守 提交于 2019-11-26 07:39:40
问题 I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called \"bed\" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would

Dictionary style replace multiple items

偶尔善良 提交于 2019-11-26 01:33:59
问题 I have a large data.frame of character data that I want to convert based on what is commonly called a dictionary in other languages. Currently I am going about it like so: foo <- data.frame(snp1 = c(\"AA\", \"AG\", \"AA\", \"AA\"), snp2 = c(\"AA\", \"AT\", \"AG\", \"AA\"), snp3 = c(NA, \"GG\", \"GG\", \"GC\"), stringsAsFactors=FALSE) foo <- replace(foo, foo == \"AA\", \"0101\") foo <- replace(foo, foo == \"AC\", \"0102\") foo <- replace(foo, foo == \"AG\", \"0103\") This works fine, but it is

Remove part of string after “.”

谁说我不能喝 提交于 2019-11-25 19:42:41
I am working with NCBI Reference Sequence accession numbers like variable a : a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2") To get information from the biomart package I need to remove the .1 , .2 etc. after the accession numbers. I normally do this with this code: b <- sub("..*", "", a) # [1] "" "" "" "" "" "" But as you can see, this isn't the correct way for this variable. Can anyone help me with this? You just need to escape the period: a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2") gsub("\\.