bioinformatics

How to extract chains from a PDB file?

≯℡__Kan透↙ 提交于 2019-12-01 00:49:41
I would like to extract chains from pdb files. I have a file named pdb.txt which contains pdb IDs as shown below. The first four characters represent PDB IDs and last character is the chain IDs. 1B68A 1BZ4B 4FUTA I would like to 1) read the file line by line 2) download the atomic coordinates of each chain from the corresponding PDB files. 3) save the output to a folder. I used the following script to extract chains. But this code prints only A chains from pdb files. for i in 1B68 1BZ4 4FUT do wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="

R running average for non-time data

≯℡__Kan透↙ 提交于 2019-12-01 00:37:30
This is the plot I'm having now. It's generated from this code: ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) + geom_point() + facet_grid(~ CHROM,scales="free_x",space="free_x") + theme(strip.text.x = element_text(size=40), strip.background = element_rect(color='lightblue',fill='lightblue'), legend.position="top", legend.title = element_text(size=40,colour="lightblue"), legend.text = element_text(size=40), legend.key.size = unit(2.5, "cm")) + guides(fill = guide_legend(title.position="top", title = "Legend:GT='REF'+'ALT'"), shape = guide_legend(override.aes=list(size=10))) + scale_y_log10(breaks

SMILES from graph

ε祈祈猫儿з 提交于 2019-11-30 20:48:41
Is there a method or package that converts a graph (or adjacency matrix) into a SMILES string? For instance, I know the atoms are [6 6 7 6 6 6 6 8] ([C C N C C C C O]) , and the adjacency matrix is [[ 0., 1., 0., 0., 0., 0., 0., 0.], [ 1., 0., 2., 0., 0., 0., 0., 1.], [ 0., 2., 0., 1., 0., 0., 0., 0.], [ 0., 0., 1., 0., 1., 0., 0., 0.], [ 0., 0., 0., 1., 0., 1., 0., 0.], [ 0., 0., 0., 0., 1., 0., 1., 1.], [ 0., 0., 0., 0., 0., 1., 0., 0.], [ 0., 1., 0., 0., 0., 1., 0., 0.]] I need some function to output 'CC1=NCCC(C)O1' . It also works if some function can output the corresponding "mol" object

R running average for non-time data

荒凉一梦 提交于 2019-11-30 19:39:37
问题 This is the plot I'm having now. It's generated from this code: ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) + geom_point() + facet_grid(~ CHROM,scales="free_x",space="free_x") + theme(strip.text.x = element_text(size=40), strip.background = element_rect(color='lightblue',fill='lightblue'), legend.position="top", legend.title = element_text(size=40,colour="lightblue"), legend.text = element_text(size=40), legend.key.size = unit(2.5, "cm")) + guides(fill = guide_legend(title.position="top",

Reverse complement of DNA strand using Python

感情迁移 提交于 2019-11-30 14:20:25
I have a DNA sequence and would like to get reverse complement of it using Python. It is in one of the columns of a CSV file and I'd like to write the reverse complement to another column in the same file. The tricky part is, there are a few cells with something other than A, T, G and C. I was able to get reverse complement with this piece of code: def complement(seq): complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} bases = list(seq) bases = [complement[base] for base in bases] return ''.join(bases) def reverse_complement(s): return complement(s[::-1]) print "Reverse Complement:" print

Querying the DNS service records to find the hostname and TCP/IP

淺唱寂寞╮ 提交于 2019-11-30 14:15:22
In a paper about the Life Science Identifiers (see LSID Tester, a tool for testing Life Science Identifier resolution services ), Dr Roderic DM Page wrote : Given the LSID urn:lsid**:ubio.org**:namebank:11815, querying the DNS for the SRV record for _lsid._tcp . ubio.org returns animalia.ubio.org:80 as the location of the ubio.org LSID service. I learned that I can link _lsid._tcp.ubio.org to animalia.ubio.org:80 using the host command on unix: host -t srv _lsid._tcp.ubio.org _lsid._tcp.ubio.org has SRV record 1 0 80 ANIMALIA.ubio.org How can I do this 'DNS' thing using the Java J2SE API

How to remove rows with 0 values using R

∥☆過路亽.° 提交于 2019-11-30 13:44:13
Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix. Input gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0 XLOC_000001 3516 626 1277 770 4309 9030 XLOC_000002 342 82 185 72 835 1095 XLOC_000003 2000 361 867 438 454 687 XLOC_000004 143 30 67 37 90 236 XLOC_000005 0 0 0 0 0 0 XLOC_000006 0 0 0 0 0 0 XLOC_000007 0 0 0 0 1 3 XLOC_000008 0 0 0 0 0 0 XLOC_000009 0 0 0 0

Complement a DNA sequence

空扰寡人 提交于 2019-11-30 13:39:45
Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ? s=readline() ATCTCGGCGCGCATCGCGTACGCTACTAGC p=unlist(strsplit(s,"")) h=rep("N",nchar(s)) unlist(lapply(p,function(d){ for b in (1:nchar(s)) { if (p[b]=="A") h[b]="T" if (p[b]=="T") h[b]="A" if (p[b]=="G") h[b]="C" if (p[b]=="C") h[b]="G" } Use chartr which is built for this purpose: > s [1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC" > chartr("ATGC","TACG",s) [1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG" Just give it two equal-length character strings and your string. Also

Pandas: .groupby().size() and percentages

妖精的绣舞 提交于 2019-11-30 13:11:05
I have a DataFrame that originates from a df.groupby().size() operation, and looks like this: Localization RNA level cytoplasm 1 Non-expressed 7 2 Very low 13 3 Low 8 4 Medium 6 5 Moderate 8 6 High 2 7 Very high 6 cytoplasm & nucleus 1 Non-expressed 5 2 Very low 8 3 Low 2 4 Medium 10 5 Moderate 16 6 High 6 7 Very high 5 cytoplasm & nucleus & plasma membrane 1 Non-expressed 6 2 Very low 3 3 Low 3 4 Medium 7 5 Moderate 8 6 High 4 7 Very high 1 What I want to do is to calculate the separate occurrences (i.e. the last column coming from .size() ) as a percentage of the total number of occurrences

Finding matching keys in two large dictionaries and doing it fast

◇◆丶佛笑我妖孽 提交于 2019-11-30 11:08:18
I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries. Say for example: myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' } myNames = { 'Actinobacter': '8924342' } I want to print out the value for Actinobacter (8924342) since it matches a value in myRDP. The following code works, but is very slow: for key in myRDP: for jey in myNames: if key == jey: print key, myNames[key] I've tried the following but it always results in a KeyError: for key in myRDP: print myNames[key] Is there perhaps a function implemented in C for doing