bioinformatics

R, biocLite, error installing DESeq2

孤人 提交于 2019-12-11 13:12:27
问题 I've been trying to install DESeq2 to do some analysis for a couple days now. R and biocLite are up to date, and I'm running into permission errors when I try to run biocLite("DESeq2") I receive mostly good messages, but at the end I get: 1: In install.packages(pkgs = pkgs, lib = lib, repos = repos, ...) : installation of package ‘XML’ had non-zero exit status 2: In install.packages(pkgs = pkgs, lib = lib, repos = repos, ...) : installation of package ‘annotate’ had non-zero exit status 3: In

Find multiple matches of this and that nucleotide sequence

六月ゝ 毕业季﹏ 提交于 2019-12-11 12:09:35
问题 I want find every incident of ATG...TAG or ATG...TAA. I have tried the following: #!/usr/bin/perl use warnings; use strict; my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC'); while($file =~ /((?=(ATG\w+?TAG|ATG\w+?TAA))/g){ print "$1\n"; } which gives- ATGCCCCCCCCCCCCCTAG ATGAAAAAAAAAATAAATGAAAAATAG ATGAAAAATAG I want - ATGCCCCCCCCCCCCCTAG ATGAAAAAAAAAATAA ATGAAAAATAG What im doing wrong? 回答1: You are actually very close, it appears from your statement above that

Snakemake and Pandas syntax: Getting sample specific parameters from the sample table

本小妞迷上赌 提交于 2019-12-11 11:53:48
问题 First off all, this could be a duplicate of Snakemake and pandas syntax. However, I'm still confused so I'd like to explain again. In Snakemake I have loaded a sample table with several columns. One of the columns is called 'Read1', it contains sample specific read lengths. I would like to get this value for every sample separately as it may differ. What I would expect to work is this: rule mismatch_profile: input: rseqc_input_bam output: os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls

Unable to run ComBat script from R's sva library

泪湿孤枕 提交于 2019-12-11 11:52:57
问题 I am trying to run ComBat script on a dataset with 2 batches, but I am getting errors and I do not know how to inspect code since I am an R newbie. I am running ComBat method in this way: # Load sva library(sva) # Read expression values dat = read.table('dataset.xls', header=TRUE, sep='\t') # Read sample information file about batches sif = read.delim('sif.tsv', header=TRUE, sep='\t') # Call ComBat ComBat(dat=dat,batch=sif$Batch, mod=NULL) Anyway my output is: Found 2 batches Found 0

Algorithm to to Cluster Similar Strings in Python?

跟風遠走 提交于 2019-12-11 09:48:29
问题 I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences

Selecting entries that are numerically close to each other in a database

断了今生、忘了曾经 提交于 2019-12-11 09:45:18
问题 Lets say I have a table called ABC in a MS-Access Database. There are several columns in this table but only two columns are of interest for this question - "Hugo_symbol" and "Start_position". "Hugo_Symbol" has gene names and several lines can have the same Hugo_symbol - meaning this column has duplicate entries. "Start_position" has numbers - anything from 1000 to 100000000. I want to build a query that returns lines from table ABC that 1) Have the same Hugo_Symbol AND 2) Start_position is

Issue with lapply using biomart

丶灬走出姿态 提交于 2019-12-11 09:19:14
问题 I am trying to use lapply to change the species name when extracting all the human genes. I'm still learning how to use lapply, I cant work out what I'm doing wrong. So far I have: library(biomaRt) I create the marts: ensembl_hsapiens <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") ensembl_mmusculus <- useMart("ensembl", dataset = "mmusculus_gene_ensembl") ensembl_ggallus <- useMart("ensembl", dataset = "ggallus_gene_ensembl") Set the species: species <- c("hsapiens", "mmusculus",

Optimization of an all-paths algorithm

会有一股神秘感。 提交于 2019-12-11 09:06:24
问题 I've been successful using the following algorithm to complete all-path data up to path length of 10 on graphs of ~900 nodes. However, I want to scale it up to larger graphs and I'm wondering if there are further optimizations I can do. So far I have: After a node has completed it's DFS the paths are saved to a hash table. Should said node be encountered, paths from the hash table are appended so work is not repeated. Nodes are sorted by their degree (highest first). This way nodes most

Unstable output values from ANN and improving accuracy

久未见 提交于 2019-12-11 08:10:45
问题 I am trying to develop an Artificial Neural Network using PyBrain to model biological data. My ANN compiles and runs, but its accuracy value is very low, never surpassing ~62%. From a coding perspective, how can I improve the ANN's accuracy? Something I noticed was that each time, the outputs of the ANN are not the same, either, even though the test data set doesn't change--is there a reason the ANN is acting to unstably, and how can I improve this? Thank you! :) 回答1: If you creating new

Rosalind “Mendel's First Law” IPRB

谁都会走 提交于 2019-12-11 07:08:36
问题 As preparation for an upcoming bioinformatics course, I am doing some assignments from rosalind.info. I am currently stuck in the assignment "Mendel's First Law". I think I could brute force myself through this, but that somehow my thinking must be too convoluted. My approach would be this: Build a tree of probabilities which has three levels. There are two creatures that mate, creature A and creature B. First level is, what is the probability for picking as creature A homozygous dominant (k)