bioinformatics | 易学教程

Recursive Generators in Python

阅读更多关于 Recursive Generators in Python

I wrote a function to return a generator containing every unique combination of sub-strings a given length that contain more than n elements from a primary string. As an illustration: if i have 'abcdefghi' and a probe of length of two, and a threshold of 4 elements per list i'd like to get: ['ab', 'cd', 'ef', 'gh'] ['ab', 'de', 'fg', 'hi'] ['bc', 'de', 'fg', 'hi'] My first attempt at this problem involved returning a list of lists. This ended up overflowing the memory of the computer. As a crude secondary solution, I created a generator that does something similar. The problem is that I

Finding matching keys in two large dictionaries and doing it fast

阅读更多关于 Finding matching keys in two large dictionaries and doing it fast

问题 I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries. Say for example: myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' } myNames = { 'Actinobacter': '8924342' } I want to print out the value for Actinobacter (8924342) since it matches a value in myRDP. The following code works, but is very slow: for key in myRDP: for jey in myNames: if key == jey: print key, myNames[key] I've tried the following but it always results in a

Read FASTA into a dataframe and extract subsequences of FASTA file

阅读更多关于 Read FASTA into a dataframe and extract subsequences of FASTA file

问题 I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start,

Pandas: .groupby().size() and percentages

阅读更多关于 Pandas: .groupby().size() and percentages

问题 I have a DataFrame that originates from a df.groupby().size() operation, and looks like this: Localization RNA level cytoplasm 1 Non-expressed 7 2 Very low 13 3 Low 8 4 Medium 6 5 Moderate 8 6 High 2 7 Very high 6 cytoplasm & nucleus 1 Non-expressed 5 2 Very low 8 3 Low 2 4 Medium 10 5 Moderate 16 6 High 6 7 Very high 5 cytoplasm & nucleus & plasma membrane 1 Non-expressed 6 2 Very low 3 3 Low 3 4 Medium 7 5 Moderate 8 6 High 4 7 Very high 1 What I want to do is to calculate the separate

BioMart: Is there a way to easily change the species for all of my code?

阅读更多关于 BioMart: Is there a way to easily change the species for all of my code?

Below is a small fraction of my code: library(biomaRt) ensembl_hsapiens <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"), filters = "biotype", values = "protein_coding", mart = ensembl_hsapiens) paralogues[["hsapiens"]] <- getBM(attributes = c("external_gene_name", "hsapiens_paralog_associated_gene_name"), filters = "ensembl_gene_id", values = c(ensembl_gene_ID) , mart = ensembl_hsapiens) This bit of code will only allow me to extract the paralogues for hsapiens, it there a way for me to easily get the

Using regex to transform data into a dictionary in Python

阅读更多关于 Using regex to transform data into a dictionary in Python

I have a dataset with FASTA formatted sequencing, basically like this: >pc284 ATCGCGACTCGAC >pc293 ACCCGACCTCAGC I want to take to use each tag as a key in the dictionary, and store the gene as a value. This is the code I have, but really isn't doing anything: import re fileData = open('d.fasta', 'r') myDict = dict() for line in fileData: match = re.search('(\>)(\w+)(\r)(\w+)', line) if match: gene = match.group(3) myDict[gene[0]] = gene[1] print myDict \r is not a valid character class, I think you meant to use \s instead. You can reduce the groups if you don't use them either. But most of

Convert csv to Newick tree

阅读更多关于 Convert csv to Newick tree

So I have a csv file where each line represents hierarchical data in the form: 'Phylum','Class','Order','Family','Genus','Species','Subspecies','unique_gi' I would like to convert this to the classic Newick tree format sans distances. Either a novel method or a python package would be amazing. Thank you! You could use some simple Python to build out a tree from the CSV, and then write it out to a Newick tree. Not sure if this is what you're trying to do or not. import csv from collections import defaultdict from pprint import pprint def tree(): return defaultdict(tree) def tree_add(t, path):

How can I convert Ensembl ID to gene symbol in R?

阅读更多关于 How can I convert Ensembl ID to gene symbol in R?

问题 I have a data.frame containing Ensembl IDs in one column; I would like to find corresponding gene symbols for the values of that column and add them to a new column in my data frame. I used bioMaRt but It couldn't find any of the Ensembl IDs! Here is my sample data ( df[1:2,] ): row.names organism gene 41 Homo-Sapiens ENSP00000335357 115 Homo-Sapiens ENSP00000227378 and I want to get something like this row.names organism gene id 41 Homo-Sapiens ENSP00000335357 CDKN3 115 Homo-Sapiens

Read FASTA into a dataframe and extract subsequences of FASTA file

阅读更多关于 Read FASTA into a dataframe and extract subsequences of FASTA file

I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start, end) location? NM_000016 1 3 #"ACA" NM_000775 2 6 #"TAACC" NM_003820 3 5 #"TTC" You should have a look

Itertools to generate scrambled combinations

阅读更多关于 Itertools to generate scrambled combinations

What I want to do is obtain all combinations and all unique permutations of each combination. The combinations with replacement function only gets me so far: from itertools import combinations_with_replacement as cwr foo = list(cwr('ACGT', n)) ## n is an integer My intuition on how to move forward is to do something like this: import numpy as np from itertools import permutations as perm bar = [] for x in foo: carp = list(perm(x)) for i in range(len(carp)): for j in range(i+1,len(carp)): if carp[i] == carp[j]: carp[j] = '' carp = carp[list(np.where(np.array(carp) != '')[0])] for y in carp: bar