bioinformatics | 易学教程

Dataframe processing

阅读更多关于 Dataframe processing

I have a dataframe, which I read by Match <- read.table("Match.txt", sep="", fill =T, stringsAsFactors = FALSE, quote = "", header = F) and looks like this: > ab V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 Inspecting sequence ID chr1:173244300-173244500 NA NA 2 V$ATF3_Q6 | 19 (-) | 0.877 | 0.622 | aagtccCATCAggg 3 V$ATF3_Q6 | 34 (-) | 0.788 | 0.655 | agggaaCGACAcag 4 V$ATF3_Q6 | 102 (+) | 0.738 | 0.685 | cccTGAGCttagga 5 V$CEBPB_01 | 24 (+) | 0.950 | 0.882 | ccatcagGGAAGgg 72 V$YY1_01 | 117 (+) | 0.996 | 0.984 | acttCCCATcttttaag 73 Inspecting sequence ID chr1:173244350-173244550 NA NA 74 V$ATF3_Q6 | 52

Split a column to multiple columns

阅读更多关于 Split a column to multiple columns

I have table that the first column is: chr10:100002872-100002872 chr10:100003981-100003981 chr10:100004774-100004774 chr10:100005285-100005285 chr10:100007123-100007123 I want to convert it to 3 separate columns but I couldn't define ":" and "-" to used strsplit command. What should I do? Here's one way: library(data.table) DF[, paste0("V1.",1:3) ] <- tstrsplit(DF$V1, ":|-") # V1 V1.1 V1.2 V1.3 # 1 chr10:100002872-100002872 chr10 100002872 100002872 # 2 chr10:100003981-100003981 chr10 100003981 100003981 # 3 chr10:100004774-100004774 chr10 100004774 100004774 # 4 chr10:100005285-100005285

Pyparsing: extract variable length, variable content, variable whitespace substring

阅读更多关于 Pyparsing: extract variable length, variable content, variable whitespace substring

I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another number. Humans typed these in over two decades. Various conventions of whitespace and modifiers are included. Below is my Backus-Naur form so far, and two example records. Just for prostatectomies, we're looking at upwards of a thousand cases. I am using pyparsing because I'm learning python, and have no fond memories of my very limited exposure to regex writing. My question: how can I pluck out these Gleason grades

Processing the input file based on range overlap

阅读更多关于 Processing the input file based on range overlap

I have a huge input file (a representative sample of which is shown below as input ): > input CT1 CT2 CT3 1 chr1:200-400 chr1:250-450 chr1:400-800 2 chr1:800-970 chr2:200-500 chr1:700-870 3 chr2:300-700 chr2:600-1000 chr2:700-1400 I want to process it by following some rules (described below) so that I get an output like: > output CT1 CT2 CT3 chr1:200-400 1 1 0 chr1:800-970 1 0 0 chr2:300-700 1 1 0 chr1:250-450 1 1 0 chr2:200-500 1 1 0 chr2:600-1000 0 1 1 chr1:400-800 0 0 1 chr1:700-870 0 1 1 chr2:700-1400 0 1 1 Rules: Take every index (the first in this case is chr1:200-400 ), see if it

BioMart: Is there a way to easily change the species for all of my code?

阅读更多关于 BioMart: Is there a way to easily change the species for all of my code?

问题 Below is a small fraction of my code: library(biomaRt) ensembl_hsapiens <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"), filters = "biotype", values = "protein_coding", mart = ensembl_hsapiens) paralogues[["hsapiens"]] <- getBM(attributes = c("external_gene_name", "hsapiens_paralog_associated_gene_name"), filters = "ensembl_gene_id", values = c(ensembl_gene_ID) , mart = ensembl_hsapiens) This bit of

Using regex to transform data into a dictionary in Python

阅读更多关于 Using regex to transform data into a dictionary in Python

问题 I have a dataset with FASTA formatted sequencing, basically like this: >pc284 ATCGCGACTCGAC >pc293 ACCCGACCTCAGC I want to take to use each tag as a key in the dictionary, and store the gene as a value. This is the code I have, but really isn't doing anything: import re fileData = open('d.fasta', 'r') myDict = dict() for line in fileData: match = re.search('(\>)(\w+)(\r)(\w+)', line) if match: gene = match.group(3) myDict[gene[0]] = gene[1] print myDict 回答1: \r is not a valid character class,

WinError 2 The system cannot find the file specified (Python)

阅读更多关于 WinError 2 The system cannot find the file specified (Python)

问题 I have a Fortran program and want to execute it in python for multiple files. I have 2000 input files but in my Fortran code I am able to run only one file at a time. How should I call the Fortran program in python? My Script: import subprocess import glob input = glob.glob('C:/Users/Vishnu/Desktop/Fortran_Program_Rum/*.txt') output = glob.glob('C:/Users/Vishnu/Desktop/Fortran_Program_Rum/Output/') f = open("output", "w") for i in input: subprocess.Popen(["FORTRAN ~/C:/Users/Vishnu/Desktop

How to randomly extract FASTA sequences using Python?

阅读更多关于 How to randomly extract FASTA sequences using Python?

I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me? A.fasta >chr1:1310706-1310726 GACGGTTTCCGGTTAGTGGAA >chr1:901959-901979 GAGGGCTTTCTGGAGAAGGAG >chr1:983001-983021 GTCCGCTTGCGGGACCTGGGG >chr1:984333-984353 CTGGAATTCCGGGCGCTGGAG >chr1:1154147-1154167 GAGATCGTCCGGGACCTGGGT Expected Output >chr1

Convert csv to Newick tree

阅读更多关于 Convert csv to Newick tree

问题 So I have a csv file where each line represents hierarchical data in the form: 'Phylum','Class','Order','Family','Genus','Species','Subspecies','unique_gi' I would like to convert this to the classic Newick tree format sans distances. Either a novel method or a python package would be amazing. Thank you! 回答1: You could use some simple Python to build out a tree from the CSV, and then write it out to a Newick tree. Not sure if this is what you're trying to do or not. import csv from

Using the reserved word “class” as field name in Django and Django REST Framework

阅读更多关于 Using the reserved word “class” as field name in Django and Django REST Framework

Description of the problem Taxonomy is the science of defining and naming groups of biological organisms on the basis of shared characteristics. Organisms are grouped together into taxa (singular: taxon) and these groups are given a taxonomic rank. The principal ranks in modern use are domain, kingdom, phylum, class, order, family, genus and species. More information on Taxonomy and Taxonomic ranks in Wikipedia. Following the example for the red fox in the article Taxonomic rank in Wikipedia I need to create a JSON output like this: { "species": "vulpes", "genus": "Vulpes", "family": "Canidae"