bioinformatics

Recursive Generators in Python

我们两清 提交于 2019-11-30 09:09:35
I wrote a function to return a generator containing every unique combination of sub-strings a given length that contain more than n elements from a primary string. As an illustration: if i have 'abcdefghi' and a probe of length of two, and a threshold of 4 elements per list i'd like to get: ['ab', 'cd', 'ef', 'gh'] ['ab', 'de', 'fg', 'hi'] ['bc', 'de', 'fg', 'hi'] My first attempt at this problem involved returning a list of lists. This ended up overflowing the memory of the computer. As a crude secondary solution, I created a generator that does something similar. The problem is that I

Finding matching keys in two large dictionaries and doing it fast

筅森魡賤 提交于 2019-11-30 07:25:30
问题 I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries. Say for example: myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' } myNames = { 'Actinobacter': '8924342' } I want to print out the value for Actinobacter (8924342) since it matches a value in myRDP. The following code works, but is very slow: for key in myRDP: for jey in myNames: if key == jey: print key, myNames[key] I've tried the following but it always results in a

Read FASTA into a dataframe and extract subsequences of FASTA file

社会主义新天地 提交于 2019-11-30 06:56:47
问题 I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start,

Pandas: .groupby().size() and percentages

会有一股神秘感。 提交于 2019-11-29 18:45:49
问题 I have a DataFrame that originates from a df.groupby().size() operation, and looks like this: Localization RNA level cytoplasm 1 Non-expressed 7 2 Very low 13 3 Low 8 4 Medium 6 5 Moderate 8 6 High 2 7 Very high 6 cytoplasm & nucleus 1 Non-expressed 5 2 Very low 8 3 Low 2 4 Medium 10 5 Moderate 16 6 High 6 7 Very high 5 cytoplasm & nucleus & plasma membrane 1 Non-expressed 6 2 Very low 3 3 Low 3 4 Medium 7 5 Moderate 8 6 High 4 7 Very high 1 What I want to do is to calculate the separate

BioMart: Is there a way to easily change the species for all of my code?

落爺英雄遲暮 提交于 2019-11-29 16:56:56
Below is a small fraction of my code: library(biomaRt) ensembl_hsapiens <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"), filters = "biotype", values = "protein_coding", mart = ensembl_hsapiens) paralogues[["hsapiens"]] <- getBM(attributes = c("external_gene_name", "hsapiens_paralog_associated_gene_name"), filters = "ensembl_gene_id", values = c(ensembl_gene_ID) , mart = ensembl_hsapiens) This bit of code will only allow me to extract the paralogues for hsapiens, it there a way for me to easily get the

Using regex to transform data into a dictionary in Python

给你一囗甜甜゛ 提交于 2019-11-29 15:25:21
I have a dataset with FASTA formatted sequencing, basically like this: >pc284 ATCGCGACTCGAC >pc293 ACCCGACCTCAGC I want to take to use each tag as a key in the dictionary, and store the gene as a value. This is the code I have, but really isn't doing anything: import re fileData = open('d.fasta', 'r') myDict = dict() for line in fileData: match = re.search('(\>)(\w+)(\r)(\w+)', line) if match: gene = match.group(3) myDict[gene[0]] = gene[1] print myDict \r is not a valid character class, I think you meant to use \s instead. You can reduce the groups if you don't use them either. But most of

Convert csv to Newick tree

江枫思渺然 提交于 2019-11-29 08:10:27
So I have a csv file where each line represents hierarchical data in the form: 'Phylum','Class','Order','Family','Genus','Species','Subspecies','unique_gi' I would like to convert this to the classic Newick tree format sans distances. Either a novel method or a python package would be amazing. Thank you! You could use some simple Python to build out a tree from the CSV, and then write it out to a Newick tree. Not sure if this is what you're trying to do or not. import csv from collections import defaultdict from pprint import pprint def tree(): return defaultdict(tree) def tree_add(t, path):

How can I convert Ensembl ID to gene symbol in R?

假如想象 提交于 2019-11-29 02:52:19
问题 I have a data.frame containing Ensembl IDs in one column; I would like to find corresponding gene symbols for the values of that column and add them to a new column in my data frame. I used bioMaRt but It couldn't find any of the Ensembl IDs! Here is my sample data ( df[1:2,] ): row.names organism gene 41 Homo-Sapiens ENSP00000335357 115 Homo-Sapiens ENSP00000227378 and I want to get something like this row.names organism gene id 41 Homo-Sapiens ENSP00000335357 CDKN3 115 Homo-Sapiens

Read FASTA into a dataframe and extract subsequences of FASTA file

时间秒杀一切 提交于 2019-11-28 23:28:36
I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start, end) location? NM_000016 1 3 #"ACA" NM_000775 2 6 #"TAACC" NM_003820 3 5 #"TTC" You should have a look

Itertools to generate scrambled combinations

≯℡__Kan透↙ 提交于 2019-11-28 14:33:37
What I want to do is obtain all combinations and all unique permutations of each combination. The combinations with replacement function only gets me so far: from itertools import combinations_with_replacement as cwr foo = list(cwr('ACGT', n)) ## n is an integer My intuition on how to move forward is to do something like this: import numpy as np from itertools import permutations as perm bar = [] for x in foo: carp = list(perm(x)) for i in range(len(carp)): for j in range(i+1,len(carp)): if carp[i] == carp[j]: carp[j] = '' carp = carp[list(np.where(np.array(carp) != '')[0])] for y in carp: bar