bioinformatics

Querying the DNS service records to find the hostname and TCP/IP

余生长醉 提交于 2019-12-30 04:52:16
问题 In a paper about the Life Science Identifiers (see LSID Tester, a tool for testing Life Science Identifier resolution services), Dr Roderic DM Page wrote : Given the LSID urn:lsid**:ubio.org**:namebank:11815, querying the DNS for the SRV record for _lsid._tcp . ubio.org returns animalia.ubio.org:80 as the location of the ubio.org LSID service. I learned that I can link _lsid._tcp.ubio.org to animalia.ubio.org:80 using the host command on unix: host -t srv _lsid._tcp.ubio.org _lsid._tcp.ubio

How to remove rows with 0 values using R

徘徊边缘 提交于 2019-12-30 04:33:11
问题 Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix. Input gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0 XLOC_000001 3516 626 1277 770 4309 9030 XLOC_000002 342 82 185 72 835 1095 XLOC_000003 2000 361 867 438 454 687 XLOC_000004 143 30 67 37 90 236 XLOC_000005 0

Is there a Boost (or other common lib) type for matrices with string keys?

試著忘記壹切 提交于 2019-12-25 17:45:09
问题 I have a dense matrix where the indices correspond to genes. While gene identifiers are often integers, they are not contiguous integers. They could be strings instead, too. I suppose I could use a boost sparse matrix of some sort with integer keys, and it wouldn't matter if they're contiguous. Or would this still occupy a great deal of space, particularly if some genes have identifiers that are nine digits? Further, I am concerned that sparse storage is not appropriate, since this is an all

How to translate a FASTA sequence from dict/ how to make function output a string?

瘦欲@ 提交于 2019-12-25 06:44:52
问题 Firstly I can't use BioPython :( I need to translate a bunch of FASTA sequences from a FASTA file and translate them to protein sequence. FASTA file is like this; >some info ACCGGGCTAAA >other info ACCGCCAATTT So I can create a function that outputs only the DNA sequence but when I try to translate it I get the following error; "TypeError: object of type '_io.TextIOWrapper' has no len()" I have no ide how to resolve this. Any help is immensely appreciated!!!!! Also I am taking my first Python

biojava Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException:

早过忘川 提交于 2019-12-25 02:55:09
问题 I have problem with multiple sequence alignement. I have two sequences as follow and I m trying to align them using biojava methods and I get error like this. I have no idea what is wrong. I know that sequences are not the same length but it should not matter. GSKTGTKITFYEDKNFQGRRYDCDCDCADFHTYLSRCNSIKVEGGTWAVYERPNFAGYMYILPQGEYPEYQRWMGLNDRLSSCRAVHLPSGGQYKIQIFEKGDFSGQMYETTEDCPSIMEQFHMREIHSCKVLEGVWIFYELPNYRGRQYLLDKKEYRKPIDWGAASPAVQSFRRIVE

Use awk with two different delimiters to split and select columns

江枫思渺然 提交于 2019-12-25 02:44:13
问题 How can I tell gawk to use two different delimiters so that I can separate some columns, but select others using the tab-delimited format of my file? > cat broad_snps.tab chrsnpID rsID freq_bin snp_maf gene_count dist_nearest_gene_snpsnap dist_nearest_gene_snpsnap_protein_coding dist_nearest_gene dist_nearest_gene_located_within loci_upstream loci_downstream ID_nearest_gene_snpsnap ID_nearest_gene_snpsnap_protein_coding ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap

Finding common ID's (intersection) in two dictionaries

落爺英雄遲暮 提交于 2019-12-25 02:19:31
问题 I wrote a piece of code that is supposed to find common intersecting ID's in line[1] in two different files. On my small sample files it works OK, but on my bigger files does not. I cannot figure out why, can you suggest me what is wrong? The exact problem is when my input is i.e. 200 it gives me 90 intersections, if I reduce it to 150, it gives me intersections of 110, logically it cannot be higher. fileA = open("file1.txt",'r') fileB = open("file2.txt",'r') output = open("result.txt",'w')

Simplifying elements of a list/array and then adding incremental identifiers a,b,c,d… etc to them

a 夏天 提交于 2019-12-25 01:44:47
问题 I'm processing headers of a .fasta file (which is a file universally used in genetics/bioinformatics to store DNA/RNA sequence data). Fasta files have headers starting with a > symbol (which gives specific info), followed by the actual sequence data on the next line that the header describes. The sequence data extends indefinitely until the next \n after which is followed the next header and its respective sequence. For example: >scaffold1.1_size947603

Genomic coordinates of HGNC gene names

为君一笑 提交于 2019-12-24 10:14:23
问题 I want to get coordinates of human genes from my list (consisting of hgnc genes id) using GenomicFeatures and TxDb.Hsapiens.UCSC.hg19.knownGene R packages from Bioconductor. library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb=(TxDb.Hsapiens.UCSC.hg19.knownGene) my_genes = c("INO80","NASP","INO80D","SMARCA1") select(txdb, keys = my_genes, columns=c("TXCHROM","TXSTART","TXEND","TXSTRAND"), keytype="GENEID") However, it doesn't' work because txdb doesn't take hgnc identifiers; how can it be solved?

Odds ratio for ordinal variables from PROC GENMOD

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-24 06:03:35
问题 I have a set of data where I am creating a logistic regression model, looking at the odds of a binary outcome variable (Therapy), with Stage as an ordinal explanatory variable (0,1,2,3,4). A1c is a continuous variable. Because each patient has two eyes, I must use the repeated subject = patientID(EyeID) statement. The following is my code: PROC GENMOD data=new descend; class patientID EyeID Stage (param = ordinal) Therapy (ref ="0") Gender(ref="M") Ethnic agegroup/ PARAM=ref; model Therapy =