bioinformatics | 易学教程

Improving clojure lazy-seq usage for iterative text parsing

阅读更多关于 Improving clojure lazy-seq usage for iterative text parsing

问题 I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format: >1 GATCGA GTC >2 GCA >3 AAAAA For more background see this related StackOverflow post about an Erlang solution. My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using

Using pseudocolour in ggplot2 scatter plot to indicate density

阅读更多关于 Using pseudocolour in ggplot2 scatter plot to indicate density

问题 Does someone know how to create a graph like the one in the screenshot? I've tried to get a similar effect adjusting alpha, but this renders outliers to be almost invisible. I know this type of graph only from a software called FlowJo, here they refer to it as "pseudocolored dot plot". Not sure if this an official term. I'd like to do it specifically in ggplot2, as I need the faceting option. I attached another screenshot of one of my graphs. The vertical lines depict clusters of mutations at

Reverse complement of DNA strand using Python

阅读更多关于 Reverse complement of DNA strand using Python

问题 I have a DNA sequence and would like to get reverse complement of it using Python. It is in one of the columns of a CSV file and I'd like to write the reverse complement to another column in the same file. The tricky part is, there are a few cells with something other than A, T, G and C. I was able to get reverse complement with this piece of code: def complement(seq): complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} bases = list(seq) bases = [complement[base] for base in bases] return '

Complement a DNA sequence

阅读更多关于 Complement a DNA sequence

问题 Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ? s=readline() ATCTCGGCGCGCATCGCGTACGCTACTAGC p=unlist(strsplit(s,"")) h=rep("N",nchar(s)) unlist(lapply(p,function(d){ for b in (1:nchar(s)) { if (p[b]=="A") h[b]="T" if (p[b]=="T") h[b]="A" if (p[b]=="G") h[b]="C" if (p[b]=="C") h[b]="G" } 回答1: Use chartr which is built for this purpose: > s [1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC" > chartr("ATGC","TACG",s)

Recursive Generators in Python

阅读更多关于 Recursive Generators in Python

问题 I wrote a function to return a generator containing every unique combination of sub-strings a given length that contain more than n elements from a primary string. As an illustration: if i have 'abcdefghi' and a probe of length of two, and a threshold of 4 elements per list i'd like to get: ['ab', 'cd', 'ef', 'gh'] ['ab', 'de', 'fg', 'hi'] ['bc', 'de', 'fg', 'hi'] My first attempt at this problem involved returning a list of lists. This ended up overflowing the memory of the computer. As a

How to run binary executables in multi-thread HPC cluster?

阅读更多关于 How to run binary executables in multi-thread HPC cluster?

问题 I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory? 回答1: The scheduler just runs the binary

compare multiple vectors of different lengths, count the elements that are the same, and print out those that are the same and different

阅读更多关于 compare multiple vectors of different lengths, count the elements that are the same, and print out those that are the same and different

问题 I have five vectors with the following format, and of varying lengths. The are vectors of single nucleotide polymorphisms (SNPs) A <- c("2179_39","2764_47","4521_24","9056_66") B <- c("2478_39","2734_47","4531_24","2178_39","2734_47","4521_24") In R, I would like to: pint out which SNPs match between the different vectors count the number of SNPs that match print out which SNPs do not match count the number of SNPs that do not match I found the following script that prints out the locations

How can I download the entire GenBank file with just an accession number?

阅读更多关于 How can I download the entire GenBank file with just an accession number?

问题 I've got an array full of accession numbers, and I'm wondering if there's a way to automatically save genbank files using BioPerl. I know you can grab sequence information, but I want the entire GenBank record. #!/usr/bin/env perl use strict; use warnings; use Bio::DB::GenBank; my @accession; open (REFINED, "./refine.txt") || die "Could not open: $!"; while(<REFINED>){ if(/^(\D+)\|(.*?)\|/){ push(@accession, $2); } } close REFINED; foreach my $number(@accession){ my $db_obj = Bio::DB::GenBank

exec() not returning process ID

阅读更多关于 exec() not returning process ID

问题 I'm using the PHP exec() function to execute the Canu assembler programs, and I want to get its process ID within the same script. The problem is exec() not returning any PID, even the process is running successfully. The processes are started like this: $gnuplot_path = '/usr/bin/gnuplot'; $command = 'nohup canu -d . -p E.coli gnuplot='.$gnuplot_path.' genomeSize=4.8m useGrid=false maxThreads=30 -pacbio-raw /path/to/p6.25x.fastq > /path/to/process.err 2>&1 &'; Currently, I try to determine if

exec() not returning process ID

阅读更多关于 exec() not returning process ID