bioinformatics

Finding overlap in ranges with R

那年仲夏 提交于 2019-12-17 10:33:51
问题 I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop . Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to

Remove part of string after “.”

…衆ロ難τιáo~ 提交于 2019-12-16 21:30:10
问题 I am working with NCBI Reference Sequence accession numbers like variable a : a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2") To get information from the biomart package I need to remove the .1 , .2 etc. after the accession numbers. I normally do this with this code: b <- sub("..*", "", a) # [1] "" "" "" "" "" "" But as you can see, this isn't the correct way for this variable. Can anyone help me with this? 回答1: You just need to escape the

FSL doesn't install on Ubuntu 16.04 from NeuroDebian

匆匆过客 提交于 2019-12-14 03:31:13
问题 I am trying to install FSL on Ubuntu 16.04 64-bit. I followed procedure at neurodebian website, selected the correct package, specified ALL software. When I copy-paste the commands in my terminal the pipe hangs without prompting for my sudo password: wget -O- http://neuro.debian.net/lists/xenial.de-m.full | sudo tee /etc/apt/sources.list.d/neurodebian.sources.list I also tried to separate the pipe in the two commands. The first one runs, the second asks for my password and then hangs. I con't

How to concatenate (merge) AAStringSets by name?

你离开我真会死。 提交于 2019-12-14 03:04:07
问题 In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better. Say these are two multiple sequence alignments. library(Biostrings) set1<-AAStringSet(c("IVR", "RDG", "LKS")) names(set1)<-paste("org", 1:3, sep="_") set2<-AAStringSet(c("VRT", "RKG", "AST")

How would you create and traverse a hash of hashes (of depth n), whereby the values at depth n are integers?

蓝咒 提交于 2019-12-14 03:01:16
问题 I want to store DNA sequences of size n in the described data structure. Each hash could contain the keys C,G,A,T who will have hash values. These hash values will be the exact same kind of hashes - they will have four keys, C,G,A,T who will have hash values. This structure is consistent for n levels of hashes. However, the last level of hashes will instead have integer values, which represent the count of the sequence from level 1 to level n. Given the data ('CG', 'CA', 'TT', 'CG'),

TypeError: expected bytes, str found in custom python function

本秂侑毒 提交于 2019-12-13 18:25:50
问题 I am using a new bioinformatics tool called Giggle and I have installed the python wrapper on my system. Even though the scenario is quite specific, I think the problem is quite general. This function: index = Giggle.create("index", "HMEC_hg19_BroadHMM_ALL.bed") should create an index based on several (or in this case one) .bed file. The bed files look like this: chr1 10000 10600 15_Repetitive/CNV 0 . 10000 10600 245,245,245 chr1 10600 11137 13_Heterochrom/lo 0 . 10600 11137 245,245,245 chr1

Custom Merge Function in R

被刻印的时光 ゝ 提交于 2019-12-13 18:05:05
问题 I have a large data set and I want to write a custom merge function to use with apply but I can't solve a certain issue. I can't use a loop as it will take too long. The data roughly looks like this; # [ Name, Strand, Start, End ] R1 = c( 'GeneA', '+', 1000, 1500 ) R2 = c( 'GeneA', '+', 1510, 2000 ) R3 = c( 'GeneA', '+', 2001, 2500 ) R4 = c( 'GeneB', '-', 3100, 4000 ) The data is a data.frame with rows R1:R4 So far I can get a function which compares Ri and Rj (j = i +1) and merges them if

Subset columns of one data frame according to another data frame's rows

让人想犯罪 __ 提交于 2019-12-13 16:24:07
问题 I would like to subset some of its columns according to another data frame's rows. So the two data frames are as shown below: df1 <- structure(list(ID = structure(c(3L, 1L, 2L, 5L, 4L), .Label = c("cg08", "cg09", "cg29", "cg36", "cg65"), class = "factor"), chr = c(16L, 3L, 3L, 1L, 8L), gene = c(534L, 376L, 171L, 911L, 422L), GS12 = c(0.15, 0.87, 0.6, 0.1, 0.72), GS32 = c(0.44, 0.93, 0.92, 0.07, 0.91), GS56 = c(0.46, 0.92, 0.62, 0.06, 0.87), GS87 = c(0.79, 0.93, 0.86, 0.08, 0.88)), .Names = c(

Select sequences in a fasta file with more than 300 aa and “C” occurs at least 4 times

故事扮演 提交于 2019-12-13 14:28:10
问题 I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times. I've used this command to select sequences with more than 300 aa: cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }' Some sequence example: >jgi|Triasp1|216614|CE216613_3477 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI

how to convert PHYLIP format to FASTA

怎甘沉沦 提交于 2019-12-13 13:17:04
问题 I just start working with perl and I have a question. I have PHYLIP file and I need convert it into FASTA. I start writing a script. Firstly, i removed scpaces in lines, now i need to align all lines that in every line should be 60 aminoacids and sequances identificator should be printed in new line. Maybe someone could give me some advice? 回答1: BioPerl Bio::AlignIO module might help. It support the PHYLIP sequence format : phylip2fasta.pl use strict; use warnings; use Bio::AlignIO; # http:/