snakemake

Snakemake: Error when trying to generate multiple output files

£可爱£侵袭症+ 提交于 2019-12-08 13:01:48
问题 I'm writing a snakemake pipeline to take publicly available sra files, convert them to fastq files then run them through alignment, peak calling and LD score regression. I'm having an issue in the rule called SRA2fastq below in which I use parallel-fastq-dump to convert SRA files to paired end fastq files. This rule generates two outputs for each SRA file, SRRXXXXXXX_1 , and SRRXXXXXXX_2 . Here is my config file: samples: fullard2018_NpfcATAC_1: SRR5367824 fullard2018_NpfcATAC_2: SRR5367798

Running parallel instances of a single job/rule on Snakemake

丶灬走出姿态 提交于 2019-12-08 05:26:25
问题 Unexperienced, self-tought "coder" here, so please be understanding :] I am trying to learn and use Snakemake to construct pipeline for my analysis. Unfortunatly, I am unable to run multiple instances of a single job/rule at the same time. My workstation is not a computing cluster, so I cannot use this option. I looked for an answer for hours, but either there is non, or I am not knowledgable enough to understand it. So: is there a way to run multiple instances of a single job/rule

snakemake, how to build a loop for two independent parameter

放肆的年华 提交于 2019-12-08 02:38:26
I want to loop snakemake above two different wildcards, which - I think - are somehow independent from each other. In case that there is already a solved threat for this case, I would be happy for a hint. But so far I'm not sure what the correct terms are to look for what I want to do. Let's assume my pipeline has three steps. I have a set of samples which I process in each of those three steps. Put in the second step I deploy an extra parameter to every sample. In the third step now I have to iterate through the samples and the associated parameter. Because of this structure, I think it's not

Snakemake - Override LSF (bsub) cluster config in a rule-specific manner

我的梦境 提交于 2019-12-07 15:46:30
Is it possible to define default settings for memory and resources in cluster config file, and then override in rule specific manner, when needed? Is resources field in rules directly tied to cluster config file? Or is it just a fancy way for params field for readability purposes? In the example below, how do I use default cluster configs for rule a , but use custom changes ( memory=40000 and rusage=15000 ) in rule b ? cluster.json: { "__default__": { "memory": 20000, "resources": "\"rusage[mem=8000] span[hosts=1]\"", "output": "logs/cluster/{rule}.{wildcards}.out", "error": "logs/cluster/

snakemake: how to implement log directive when using run directive?

大城市里の小女人 提交于 2019-12-07 09:34:17
Snakemake allows creation of a log for each rule with log parameter that specifies the name of the log file. It is relatively straightforward to pipe results from shell output to this log, but I am not able to figure out a way of logging output of run output (i.e. python script). One workaround is to save the python code in a script and then run it from the shell, but I wonder if there is another way? I have some rules that use both the log and run directives. In the run directive, I "manually" open and write the log file. For instance: rule compute_RPM: input: counts_table = source_small_RNA

Snakemake: How do I use a function that takes in a wildcard and returns a value?

雨燕双飞 提交于 2019-12-06 05:49:23
I have cram(bam) files that I want to split by read group. This requires reading the header and extracting the read group ids. I have this function which does that in my Snakemake file: def identify_read_groups(cram_file): import subprocess command = 'samtools view -H ' + cram_file + ' | grep ^@RG | cut -f2 | cut -f2 -d":" ' read_groups = subprocess.check_output(command, shell=True) read_groups = read_groups.split('\n')[:-1] return(read_groups) I have this rule all: rule all: input: expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))

SnakeMake rule with Python script, conda and cluster

萝らか妹 提交于 2019-12-04 15:21:18
I would like to get snakemake running a Python script with a specific conda environment via a SGE cluster. On the cluster I have miniconda installed in my home directory. My home directory is mounted via NFS so accessible to all cluster nodes. Because miniconda is in my home directory, the conda command is not on the operating system path by default. I.e., to use conda I need to first explicitly add this to the path. I have a conda environment specification as a yaml file, which could be used with the --use-conda option. Will this work with the --cluster "qsub" option also? FWIW I also launch

How do I interpolate wildcards into a shell command?

邮差的信 提交于 2019-12-04 06:17:00
问题 I'm trying to build a Snakemake pipeline, but I'm confused why filename wildcards work for input and output , but not for shell . For example, the following works fine: samplelist=[ "aa_S1", "bb_S2"] rule all: input: expand("{sample}.out", sample=samplelist) rule align: input: "{sample}.txt" output: "{sample}.out" shell: "touch {output}" But let's say that the command I use for shell actually derives from a string I give it, so I can't name the output file directly in the shell command. Then

Snakemake: How to save and access sample details in config.yml file?

强颜欢笑 提交于 2019-12-04 01:58:01
问题 Can anybody help me understand if it is possible to access sample details from a config.yml file when the sample names are not written in the snakemake workflow? This is so I can re-use the workflow for different projects and only adjust the config file. Let me give you an example: I have four samples that belong together and should be analyzed together. They are called sample1-4. Every sample comes with some more information but to keep it simple here lets say its just a name tag such as S1,

snakemake wildcards or expand command

牧云@^-^@ 提交于 2019-12-03 21:55:05
问题 I want a rule to perform realignment between normal and tumor. The main problem is I don't know how to manage that problem. Is it the wildcard or the expand the answer to my problem? This is my list of samples: conditions: pair1: tumor: "432" normal: "433" So the rule need to be something like this rule gatk_RealignerTargetCreator: input: expand("mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['tumor']), "mapped_reads/merged_samples/{sample}