snakemake

Snakemake and Pandas syntax: Getting sample specific parameters from the sample table

本小妞迷上赌 提交于 2019-12-11 11:53:48
问题 First off all, this could be a duplicate of Snakemake and pandas syntax. However, I'm still confused so I'd like to explain again. In Snakemake I have loaded a sample table with several columns. One of the columns is called 'Read1', it contains sample specific read lengths. I would like to get this value for every sample separately as it may differ. What I would expect to work is this: rule mismatch_profile: input: rseqc_input_bam output: os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls

Snakemake and pandas syntax

試著忘記壹切 提交于 2019-12-11 11:06:18
问题 I have a input file as follow SampleName Run Read1 Read2 A run1 test/true_data/4k_R1.fq test/true_data/4k_R2.fq A run2 test/samples/A.fastq test/samples/A2.fastq B run1 test/samples/B.fastq test/samples/B2.fastq C run1 test/samples/C.fastq test/samples/C5.fastq D So I am getting all indexs in an array: sample_table = pd.read_table('samples.tsv', sep=' ', lineterminator='\n') sample_table = sample_table.drop_duplicates(subset='SampleName', keep='first', inplace=False) sample_table = sample

Snakemake: How to use config file efficiently

不打扰是莪最后的温柔 提交于 2019-12-11 08:54:43
问题 I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files: samples: Sample1_XY: - fastq_files/SRR4356728_1.fastq.gz - fastq_files/SRR4356728_2.fastq.gz Sample2_AB: - fastq_files/SRR6257171_1.fastq.gz - fastq_files/SRR6257171_2.fastq.gz I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files: import os # read config info into this namespace

how to pass a function under snakemake run directive

大憨熊 提交于 2019-12-11 07:36:32
问题 I am building a workflow in snakemake and would like to recycle one of the rules to two different input sources. The input sources could be either source1 or source1+source2 and depending on the input the output directory would also vary. Since this was quite complicated to do in the same rule and I didn't want to create the copy of the full rule I would like to create two rules with different input/output, but running same command. Is it possible to make this work? I get the DAG resolved

Using multiple filenames as wildcards in Snakemake

回眸只為那壹抹淺笑 提交于 2019-12-11 06:29:13
问题 I am trying to create a rule to implement bedtools in snakemake , which will closest a file with bunch of files in another directory. What I have is, under /home/bedfiles directory, 20 bed files: 1A.bed , 2B_83.bed , 3f_33.bed ... What I want is, under /home/bedfiles directory, 20 modified bed files: 1A_modified, 2B_83_modified , 3f_33_modified ... So the bash command would be : filelist='/home/bedfiles/*.bed' for mfile in $filelist; do bedtools closest -a /home/other/merged.txt -b ${mfile} >

snakemake define folder as output

試著忘記壹切 提交于 2019-12-11 05:38:29
问题 I try to run prokka using snakemake and rule all. In the latter I define all output folders which will be produced by prokka to write the results. Prokka requires a folder to be supplied as an output rather than a file. A simplified version of what I have is here: PATIENTID_ls = range(2) rule all: input: expand("results_{subjectID}_outputfolder",subjectID=PATIENTID_ls), rule prokka: input: "contigs/subject_{subjectID}/contigs.fasta", output: "results/subject_{subjectID}_outputfolder", shell:

Can a snakemake input rule be defined with different paths/wildcards

江枫思渺然 提交于 2019-12-11 04:08:54
问题 I want to know if one can define a input rule that has dependencies on different wildcards. To elaborate, I am running this Snakemake pipeline on different fastq files using qsub which submits each job to a different node: fastqc on original fastq - no downstream dependency on other jobs adapter/quality trimming to generate trimmed fastq fastqc_after on trimmed fastq (output from step 2) and no downstream dependency star-rsem pipeline on trimmed fastq (output from step 2 above) rsem and

Snakemake: rule for using many inputs for one output with multiple sub-groups

核能气质少年 提交于 2019-12-10 19:25:33
问题 I have a working pipeline I'm using for downloading, aligning and performing variant calling on public sequencing data. The problem is that it can currently only work on a per-sample basis ( i.e sample as each individual sequencing experiment). It doesn't work if I want to perform variant calling on a group of experiments (such as biological and/or technical replicates of a sample). I've tried to solving it, but I couldn't get it work. Here's a simplification of the alignment rule: rule

Default memory request with possibility of override in a Snakefile?

天涯浪子 提交于 2019-12-10 18:31:44
问题 I have a Snakefile with several rules and only a few need more than 1 GB/core to run on a cluster. The resources directive is great for this, but I can't find a way of setting a default value. I would prefer not having to write resources: mem_per_cpu = 1024 for every rule that doesn't need more than the default. I realize that I could get what I want using __default__ in a cluster config file and overriding the mem_per_cpu value for specific rules. I hesitate to do this because the memory

Snakemake: How do I use a function that takes in a wildcard and returns a value?

巧了我就是萌 提交于 2019-12-10 10:45:29
问题 I have cram(bam) files that I want to split by read group. This requires reading the header and extracting the read group ids. I have this function which does that in my Snakemake file: def identify_read_groups(cram_file): import subprocess command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" ' read_groups = subprocess.check_output(command, shell=True) read_groups = read_groups.split('\n')[:-1] return(read_groups) I have this rule all: rule all: input: expand(