bioinformatics

Dataframe processing

独自空忆成欢 提交于 2019-11-28 14:20:29
I have a dataframe, which I read by Match <- read.table("Match.txt", sep="", fill =T, stringsAsFactors = FALSE, quote = "", header = F) and looks like this: > ab V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 Inspecting sequence ID chr1:173244300-173244500 NA NA 2 V$ATF3_Q6 | 19 (-) | 0.877 | 0.622 | aagtccCATCAggg 3 V$ATF3_Q6 | 34 (-) | 0.788 | 0.655 | agggaaCGACAcag 4 V$ATF3_Q6 | 102 (+) | 0.738 | 0.685 | cccTGAGCttagga 5 V$CEBPB_01 | 24 (+) | 0.950 | 0.882 | ccatcagGGAAGgg 72 V$YY1_01 | 117 (+) | 0.996 | 0.984 | acttCCCATcttttaag 73 Inspecting sequence ID chr1:173244350-173244550 NA NA 74 V$ATF3_Q6 | 52

Split a column to multiple columns

一笑奈何 提交于 2019-11-28 14:16:20
I have table that the first column is: chr10:100002872-100002872 chr10:100003981-100003981 chr10:100004774-100004774 chr10:100005285-100005285 chr10:100007123-100007123 I want to convert it to 3 separate columns but I couldn't define ":" and "-" to used strsplit command. What should I do? Here's one way: library(data.table) DF[, paste0("V1.",1:3) ] <- tstrsplit(DF$V1, ":|-") # V1 V1.1 V1.2 V1.3 # 1 chr10:100002872-100002872 chr10 100002872 100002872 # 2 chr10:100003981-100003981 chr10 100003981 100003981 # 3 chr10:100004774-100004774 chr10 100004774 100004774 # 4 chr10:100005285-100005285

Pyparsing: extract variable length, variable content, variable whitespace substring

為{幸葍}努か 提交于 2019-11-28 12:58:43
I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another number. Humans typed these in over two decades. Various conventions of whitespace and modifiers are included. Below is my Backus-Naur form so far, and two example records. Just for prostatectomies, we're looking at upwards of a thousand cases. I am using pyparsing because I'm learning python, and have no fond memories of my very limited exposure to regex writing. My question: how can I pluck out these Gleason grades

Processing the input file based on range overlap

天大地大妈咪最大 提交于 2019-11-28 11:51:11
I have a huge input file (a representative sample of which is shown below as input ): > input CT1 CT2 CT3 1 chr1:200-400 chr1:250-450 chr1:400-800 2 chr1:800-970 chr2:200-500 chr1:700-870 3 chr2:300-700 chr2:600-1000 chr2:700-1400 I want to process it by following some rules (described below) so that I get an output like: > output CT1 CT2 CT3 chr1:200-400 1 1 0 chr1:800-970 1 0 0 chr2:300-700 1 1 0 chr1:250-450 1 1 0 chr2:200-500 1 1 0 chr2:600-1000 0 1 1 chr1:400-800 0 0 1 chr1:700-870 0 1 1 chr2:700-1400 0 1 1 Rules: Take every index (the first in this case is chr1:200-400 ), see if it

BioMart: Is there a way to easily change the species for all of my code?

时间秒杀一切 提交于 2019-11-28 11:18:48
问题 Below is a small fraction of my code: library(biomaRt) ensembl_hsapiens <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") hsapien_PC_genes <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"), filters = "biotype", values = "protein_coding", mart = ensembl_hsapiens) paralogues[["hsapiens"]] <- getBM(attributes = c("external_gene_name", "hsapiens_paralog_associated_gene_name"), filters = "ensembl_gene_id", values = c(ensembl_gene_ID) , mart = ensembl_hsapiens) This bit of

Using regex to transform data into a dictionary in Python

ε祈祈猫儿з 提交于 2019-11-28 09:16:20
问题 I have a dataset with FASTA formatted sequencing, basically like this: >pc284 ATCGCGACTCGAC >pc293 ACCCGACCTCAGC I want to take to use each tag as a key in the dictionary, and store the gene as a value. This is the code I have, but really isn't doing anything: import re fileData = open('d.fasta', 'r') myDict = dict() for line in fileData: match = re.search('(\>)(\w+)(\r)(\w+)', line) if match: gene = match.group(3) myDict[gene[0]] = gene[1] print myDict 回答1: \r is not a valid character class,

WinError 2 The system cannot find the file specified (Python)

烂漫一生 提交于 2019-11-28 03:29:17
问题 I have a Fortran program and want to execute it in python for multiple files. I have 2000 input files but in my Fortran code I am able to run only one file at a time. How should I call the Fortran program in python? My Script: import subprocess import glob input = glob.glob('C:/Users/Vishnu/Desktop/Fortran_Program_Rum/*.txt') output = glob.glob('C:/Users/Vishnu/Desktop/Fortran_Program_Rum/Output/') f = open("output", "w") for i in input: subprocess.Popen(["FORTRAN ~/C:/Users/Vishnu/Desktop

How to randomly extract FASTA sequences using Python?

 ̄綄美尐妖づ 提交于 2019-11-28 02:11:16
I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me? A.fasta >chr1:1310706-1310726 GACGGTTTCCGGTTAGTGGAA >chr1:901959-901979 GAGGGCTTTCTGGAGAAGGAG >chr1:983001-983021 GTCCGCTTGCGGGACCTGGGG >chr1:984333-984353 CTGGAATTCCGGGCGCTGGAG >chr1:1154147-1154167 GAGATCGTCCGGGACCTGGGT Expected Output >chr1

Convert csv to Newick tree

99封情书 提交于 2019-11-28 01:59:04
问题 So I have a csv file where each line represents hierarchical data in the form: 'Phylum','Class','Order','Family','Genus','Species','Subspecies','unique_gi' I would like to convert this to the classic Newick tree format sans distances. Either a novel method or a python package would be amazing. Thank you! 回答1: You could use some simple Python to build out a tree from the CSV, and then write it out to a Newick tree. Not sure if this is what you're trying to do or not. import csv from

Using the reserved word “class” as field name in Django and Django REST Framework

巧了我就是萌 提交于 2019-11-27 23:53:38
Description of the problem Taxonomy is the science of defining and naming groups of biological organisms on the basis of shared characteristics. Organisms are grouped together into taxa (singular: taxon) and these groups are given a taxonomic rank. The principal ranks in modern use are domain, kingdom, phylum, class, order, family, genus and species. More information on Taxonomy and Taxonomic ranks in Wikipedia. Following the example for the red fox in the article Taxonomic rank in Wikipedia I need to create a JSON output like this: { "species": "vulpes", "genus": "Vulpes", "family": "Canidae"