bioinformatics | 易学教程

R, biocLite, error installing DESeq2

阅读更多关于 R, biocLite, error installing DESeq2

问题 I've been trying to install DESeq2 to do some analysis for a couple days now. R and biocLite are up to date, and I'm running into permission errors when I try to run biocLite("DESeq2") I receive mostly good messages, but at the end I get: 1: In install.packages(pkgs = pkgs, lib = lib, repos = repos, ...) : installation of package ‘XML’ had non-zero exit status 2: In install.packages(pkgs = pkgs, lib = lib, repos = repos, ...) : installation of package ‘annotate’ had non-zero exit status 3: In

Find multiple matches of this and that nucleotide sequence

阅读更多关于 Find multiple matches of this and that nucleotide sequence

问题 I want find every incident of ATG...TAG or ATG...TAA. I have tried the following: #!/usr/bin/perl use warnings; use strict; my $file = ('ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC'); while($file =~ /((?=(ATG\w+?TAG|ATG\w+?TAA))/g){ print "$1\n"; } which gives- ATGCCCCCCCCCCCCCTAG ATGAAAAAAAAAATAAATGAAAAATAG ATGAAAAATAG I want - ATGCCCCCCCCCCCCCTAG ATGAAAAAAAAAATAA ATGAAAAATAG What im doing wrong? 回答1: You are actually very close, it appears from your statement above that

Snakemake and Pandas syntax: Getting sample specific parameters from the sample table

阅读更多关于 Snakemake and Pandas syntax: Getting sample specific parameters from the sample table

问题 First off all, this could be a duplicate of Snakemake and pandas syntax. However, I'm still confused so I'd like to explain again. In Snakemake I have loaded a sample table with several columns. One of the columns is called 'Read1', it contains sample specific read lengths. I would like to get this value for every sample separately as it may differ. What I would expect to work is this: rule mismatch_profile: input: rseqc_input_bam output: os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls

Unable to run ComBat script from R's sva library

阅读更多关于 Unable to run ComBat script from R's sva library

问题 I am trying to run ComBat script on a dataset with 2 batches, but I am getting errors and I do not know how to inspect code since I am an R newbie. I am running ComBat method in this way: # Load sva library(sva) # Read expression values dat = read.table('dataset.xls', header=TRUE, sep='\t') # Read sample information file about batches sif = read.delim('sif.tsv', header=TRUE, sep='\t') # Call ComBat ComBat(dat=dat,batch=sif$Batch, mod=NULL) Anyway my output is: Found 2 batches Found 0

Algorithm to to Cluster Similar Strings in Python?

阅读更多关于 Algorithm to to Cluster Similar Strings in Python?

问题 I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences

Selecting entries that are numerically close to each other in a database

阅读更多关于 Selecting entries that are numerically close to each other in a database

问题 Lets say I have a table called ABC in a MS-Access Database. There are several columns in this table but only two columns are of interest for this question - "Hugo_symbol" and "Start_position". "Hugo_Symbol" has gene names and several lines can have the same Hugo_symbol - meaning this column has duplicate entries. "Start_position" has numbers - anything from 1000 to 100000000. I want to build a query that returns lines from table ABC that 1) Have the same Hugo_Symbol AND 2) Start_position is

Issue with lapply using biomart

阅读更多关于 Issue with lapply using biomart

问题 I am trying to use lapply to change the species name when extracting all the human genes. I'm still learning how to use lapply, I cant work out what I'm doing wrong. So far I have: library(biomaRt) I create the marts: ensembl_hsapiens <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") ensembl_mmusculus <- useMart("ensembl", dataset = "mmusculus_gene_ensembl") ensembl_ggallus <- useMart("ensembl", dataset = "ggallus_gene_ensembl") Set the species: species <- c("hsapiens", "mmusculus",

Optimization of an all-paths algorithm

阅读更多关于 Optimization of an all-paths algorithm

问题 I've been successful using the following algorithm to complete all-path data up to path length of 10 on graphs of ~900 nodes. However, I want to scale it up to larger graphs and I'm wondering if there are further optimizations I can do. So far I have: After a node has completed it's DFS the paths are saved to a hash table. Should said node be encountered, paths from the hash table are appended so work is not repeated. Nodes are sorted by their degree (highest first). This way nodes most

Unstable output values from ANN and improving accuracy

阅读更多关于 Unstable output values from ANN and improving accuracy

问题 I am trying to develop an Artificial Neural Network using PyBrain to model biological data. My ANN compiles and runs, but its accuracy value is very low, never surpassing ~62%. From a coding perspective, how can I improve the ANN's accuracy? Something I noticed was that each time, the outputs of the ANN are not the same, either, even though the test data set doesn't change--is there a reason the ANN is acting to unstably, and how can I improve this? Thank you! :) 回答1: If you creating new

Rosalind “Mendel's First Law” IPRB

阅读更多关于 Rosalind “Mendel's First Law” IPRB

问题 As preparation for an upcoming bioinformatics course, I am doing some assignments from rosalind.info. I am currently stuck in the assignment "Mendel's First Law". I think I could brute force myself through this, but that somehow my thinking must be too convoluted. My approach would be this: Build a tree of probabilities which has three levels. There are two creatures that mate, creature A and creature B. First level is, what is the probability for picking as creature A homozygous dominant (k)