bioinformatics

Improving clojure lazy-seq usage for iterative text parsing

喜欢而已 提交于 2020-01-13 12:15:10
问题 I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format: >1 GATCGA GTC >2 GCA >3 AAAAA For more background see this related StackOverflow post about an Erlang solution. My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using

Using pseudocolour in ggplot2 scatter plot to indicate density

孤街浪徒 提交于 2020-01-13 11:19:11
问题 Does someone know how to create a graph like the one in the screenshot? I've tried to get a similar effect adjusting alpha, but this renders outliers to be almost invisible. I know this type of graph only from a software called FlowJo, here they refer to it as "pseudocolored dot plot". Not sure if this an official term. I'd like to do it specifically in ggplot2, as I need the faceting option. I attached another screenshot of one of my graphs. The vertical lines depict clusters of mutations at

Reverse complement of DNA strand using Python

懵懂的女人 提交于 2020-01-10 19:44:08
问题 I have a DNA sequence and would like to get reverse complement of it using Python. It is in one of the columns of a CSV file and I'd like to write the reverse complement to another column in the same file. The tricky part is, there are a few cells with something other than A, T, G and C. I was able to get reverse complement with this piece of code: def complement(seq): complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} bases = list(seq) bases = [complement[base] for base in bases] return '

Complement a DNA sequence

萝らか妹 提交于 2020-01-10 19:33:27
问题 Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ? s=readline() ATCTCGGCGCGCATCGCGTACGCTACTAGC p=unlist(strsplit(s,"")) h=rep("N",nchar(s)) unlist(lapply(p,function(d){ for b in (1:nchar(s)) { if (p[b]=="A") h[b]="T" if (p[b]=="T") h[b]="A" if (p[b]=="G") h[b]="C" if (p[b]=="C") h[b]="G" } 回答1: Use chartr which is built for this purpose: > s [1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC" > chartr("ATGC","TACG",s)

Recursive Generators in Python

百般思念 提交于 2020-01-10 14:20:34
问题 I wrote a function to return a generator containing every unique combination of sub-strings a given length that contain more than n elements from a primary string. As an illustration: if i have 'abcdefghi' and a probe of length of two, and a threshold of 4 elements per list i'd like to get: ['ab', 'cd', 'ef', 'gh'] ['ab', 'de', 'fg', 'hi'] ['bc', 'de', 'fg', 'hi'] My first attempt at this problem involved returning a list of lists. This ended up overflowing the memory of the computer. As a

How to run binary executables in multi-thread HPC cluster?

匆匆过客 提交于 2020-01-08 02:32:23
问题 I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory? 回答1: The scheduler just runs the binary

compare multiple vectors of different lengths, count the elements that are the same, and print out those that are the same and different

你说的曾经没有我的故事 提交于 2020-01-06 05:50:07
问题 I have five vectors with the following format, and of varying lengths. The are vectors of single nucleotide polymorphisms (SNPs) A <- c("2179_39","2764_47","4521_24","9056_66") B <- c("2478_39","2734_47","4531_24","2178_39","2734_47","4521_24") In R, I would like to: pint out which SNPs match between the different vectors count the number of SNPs that match print out which SNPs do not match count the number of SNPs that do not match I found the following script that prints out the locations

How can I download the entire GenBank file with just an accession number?

廉价感情. 提交于 2020-01-05 08:07:43
问题 I've got an array full of accession numbers, and I'm wondering if there's a way to automatically save genbank files using BioPerl. I know you can grab sequence information, but I want the entire GenBank record. #!/usr/bin/env perl use strict; use warnings; use Bio::DB::GenBank; my @accession; open (REFINED, "./refine.txt") || die "Could not open: $!"; while(<REFINED>){ if(/^(\D+)\|(.*?)\|/){ push(@accession, $2); } } close REFINED; foreach my $number(@accession){ my $db_obj = Bio::DB::GenBank

exec() not returning process ID

时间秒杀一切 提交于 2020-01-05 07:36:42
问题 I'm using the PHP exec() function to execute the Canu assembler programs, and I want to get its process ID within the same script. The problem is exec() not returning any PID, even the process is running successfully. The processes are started like this: $gnuplot_path = '/usr/bin/gnuplot'; $command = 'nohup canu -d . -p E.coli gnuplot='.$gnuplot_path.' genomeSize=4.8m useGrid=false maxThreads=30 -pacbio-raw /path/to/p6.25x.fastq > /path/to/process.err 2>&1 &'; Currently, I try to determine if

exec() not returning process ID

☆樱花仙子☆ 提交于 2020-01-05 07:34:57
问题 I'm using the PHP exec() function to execute the Canu assembler programs, and I want to get its process ID within the same script. The problem is exec() not returning any PID, even the process is running successfully. The processes are started like this: $gnuplot_path = '/usr/bin/gnuplot'; $command = 'nohup canu -d . -p E.coli gnuplot='.$gnuplot_path.' genomeSize=4.8m useGrid=false maxThreads=30 -pacbio-raw /path/to/p6.25x.fastq > /path/to/process.err 2>&1 &'; Currently, I try to determine if