bioinformatics | 易学教程

Examples for Topological Sorting on Large DAGs

阅读更多关于 Examples for Topological Sorting on Large DAGs

问题 I am looking for real world applications where topological sorting is performed on large graph sizes. Some fields where I image you could find such instances would be bioinformatics, dependency resolution, databases, hardware design, data warehousing... but I hope some of you may have encountered or heard of any specific algorithms/projects/applications/datasets that require topsort. Even if the data/project may not be publicly accessible any hints (and estimates on the order of magnitude of

Subset a file by row and column numbers

阅读更多关于 Subset a file by row and column numbers

问题 We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1). inputFile.txt Tab delimited text file header 62 9 3 54 6 1 25 1 2 3 4 5 6 96 1 1 1 1 0 1 72 3 3 3 3 3 3 18 0 1 0 1 1 0 82 1 0 0 0 0 1 77 1 0 1 0 1 1 15 7 7 7 7 7 7 82 0 0 1 1 1 0 37 0 1 0 0 1 0 18 0 1 0 0 1 0 53 0 0 1 0 0 0 57 1 1 1 1 1 1 subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns,

Snakemake: unknown output/input files after splitting by chromosome

阅读更多关于 Snakemake: unknown output/input files after splitting by chromosome

问题 To speed up a certain snakemake step I would like to: split my bamfile per chromosome using bamtools split -in sample.bam --reference this results in files named as sample.REF_{chromosome}.bam perform variant calling on each resulting in e.g. sample.REF_{chromosome}.vcf recombine the obtained vcf files using vcf-concat (VCFtools) using vcf-concat file1.vcf file2.vcf file3.vcf > sample.vcf The problem is that I don't know a priori which chromosomes may be in my bam file. So I cannot specify

Bash: replace part of filename

阅读更多关于 Bash: replace part of filename

问题 I have a command I want to run on all of the files of a folder, and the command's syntax looks like this: tophat -o <output_file> <input_file> What I would like to do is a script that loops over all the files in an arbitrary folder and also uses the input file names to create similar, but different, output file names. The file names looks like this: input name desired output name path/to/sample1.fastq path/to/sample1.bam path/to/sample2.fastq path/to/sample2.bam Getting the input to work

R Bioconductor installation error - Line starting '< DOCTYPE html PUBLI …' is malformed

阅读更多关于 R Bioconductor installation error - Line starting '< DOCTYPE html PUBLI …' is malformed

问题 I'm having trouble installing bioconductor packages in R. This is on MacOSX, a fresh install of R 2.15, and using bioconductor 1.4.4. Transcript follows: > source("http://bioconductor.org/biocLite.R") BiocInstaller version 1.4.4, ?biocLite for help > biocLite("Biobase") BioC_mirror: http://bioconductor.org Using R version 2.15, BiocInstaller version 1.4.4. Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.15 Installing

SMILES from graph

阅读更多关于 SMILES from graph

问题 Is there a method or package that converts a graph (or adjacency matrix) into a SMILES string? For instance, I know the atoms are [6 6 7 6 6 6 6 8] ([C C N C C C C O]) , and the adjacency matrix is [[ 0., 1., 0., 0., 0., 0., 0., 0.], [ 1., 0., 2., 0., 0., 0., 0., 1.], [ 0., 2., 0., 1., 0., 0., 0., 0.], [ 0., 0., 1., 0., 1., 0., 0., 0.], [ 0., 0., 0., 1., 0., 1., 0., 0.], [ 0., 0., 0., 0., 1., 0., 1., 1.], [ 0., 0., 0., 0., 0., 1., 0., 0.], [ 0., 1., 0., 0., 0., 1., 0., 0.]] I need some

Split a column to multiple columns

阅读更多关于 Split a column to multiple columns

问题 I have table that the first column is: chr10:100002872-100002872 chr10:100003981-100003981 chr10:100004774-100004774 chr10:100005285-100005285 chr10:100007123-100007123 I want to convert it to 3 separate columns but I couldn't define ":" and "-" to used strsplit command. What should I do? 回答1: Here's one way: library(data.table) DF[, paste0("V1.",1:3) ] <- tstrsplit(DF$V1, ":|-") # V1 V1.1 V1.2 V1.3 # 1 chr10:100002872-100002872 chr10 100002872 100002872 # 2 chr10:100003981-100003981 chr10

Processing the input file based on range overlap

阅读更多关于 Processing the input file based on range overlap

问题 I have a huge input file (a representative sample of which is shown below as input ): > input CT1 CT2 CT3 1 chr1:200-400 chr1:250-450 chr1:400-800 2 chr1:800-970 chr2:200-500 chr1:700-870 3 chr2:300-700 chr2:600-1000 chr2:700-1400 I want to process it by following some rules (described below) so that I get an output like: > output CT1 CT2 CT3 chr1:200-400 1 1 0 chr1:800-970 1 0 0 chr2:300-700 1 1 0 chr1:250-450 1 1 0 chr2:200-500 1 1 0 chr2:600-1000 0 1 1 chr1:400-800 0 0 1 chr1:700-870 0 1 1

Find overlapping regions and extract respective value

阅读更多关于 Find overlapping regions and extract respective value

问题 How do you find the overlapping coordinates and extract the respective seg.mean values for the overlapping region? data1 Rl pValue chr start end CNA 2 2.594433 6 129740000 129780000 gain 2 3.941399 6 130080000 130380000 gain 1 1.992114 10 80900000 81100000 gain 1 7.175750 16 44780000 44920000 gain data2 ID chrom loc.start loc.end num.mark seg.mean 8410 6 129750000 129760000 8430 0.0039 8410 10 80907000 81000000 5 -1.7738 8410 16 44790000 44910000 12 0.0110 dataoutput Rl pValue chr start end

Using the reserved word “class” as field name in Django and Django REST Framework

阅读更多关于 Using the reserved word “class” as field name in Django and Django REST Framework

问题 Description of the problem Taxonomy is the science of defining and naming groups of biological organisms on the basis of shared characteristics. Organisms are grouped together into taxa (singular: taxon) and these groups are given a taxonomic rank. The principal ranks in modern use are domain, kingdom, phylum, class, order, family, genus and species. More information on Taxonomy and Taxonomic ranks in Wikipedia. Following the example for the red fox in the article Taxonomic rank in Wikipedia