Snakemake: Error when trying to generate multiple output files

问题

I'm writing a snakemake pipeline to take publicly available sra files, convert them to fastq files then run them through alignment, peak calling and LD score regression.

I'm having an issue in the rule called SRA2fastq below in which I use parallel-fastq-dump to convert SRA files to paired end fastq files. This rule generates two outputs for each SRA file, SRRXXXXXXX_1, and SRRXXXXXXX_2.

Here is my config file:

samples:
    fullard2018_NpfcATAC_1: SRR5367824
    fullard2018_NpfcATAC_2: SRR5367798
    fullard2018_NpfcATAC_3: SRR5367778
    fullard2018_NpfcATAC_4: SRR5367754
    fullard2018_NpfcATAC_5: SRR5367729

And here are the first few rules of my Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
        expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
        "FastQC/fastq_multiqc.html",
        expand("peak_files/{sample}_peaks.blrm.narrowPeak", sample=config['samples']),
        "peak_files/Fullard2018_peaks.mrgd.blrm.narrowPeak",
        expand("LD_annotation_files/Fullard_2018.{chr}.l2.ldscore.gz", chr=range(1,23))

rule SRA_prefetch:
    params:
        SRA="{SRA}"
    output:
        "/home/c1477909/ncbi/public/sra/{SRA}.sra"
    log:
        "logs/prefetch/{SRA}.log"
    shell:
        "prefetch {params.SRA}"

rule SRA2fastq:
    input:
        "/home/c1477909/ncbi/public/sra/{SRA}.sra"
    output:
        "fastq_files/{SRA}_1.fastq.gz",
        "fastq_files/{SRA}_2.fastq.gz"
    log:
        "logs/SRA2fastq/{SRA}.log"
    shell:
        """
        parallel-fastq-dump --sra-id {input} --threads 8 \
        --outdir fastq_files --split-files --gzip
        """

rule fastqc:
    input:
        rules.SRA2fastq.output
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/{SRA}_{num}_fastqc.html"
    log:
        "logs/FASTQC/{SRA}_{num}.log"
    wrapper:
        "0.27.1/bio/fastqc"

rule multiqc_fastq:
    input:
        lambda wildcards: expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2])
    output:
        "FastQC/fastq_multiqc.html"
    wrapper:
        "0.27.1/bio/multiqc"

rule bowtie2:
    input:
        sample=lambda wildcards: expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=config['samples'][wildcards.sample], num=[1,2])
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
        index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
        extra=""
    threads: 8
    wrapper:
       "0.27.1/bio/bowtie2/align"

However, when I run the Snakefile I get the following error:

Error in job SRA2fastq while creating output files fastq_files/SRR5367754_1.fastq.gz, fastq_files/SRR5367754_2.fastq.gz

I've seen this error many times before and it's usually caused when the name of output file generated by the program does not exactly match the output file name you specify in the corresponding snakemake rule. However, this is not the case here as if I run the command snakemake generates for this particular rule separately the files are created as expected and the file names match. Here is an example of one instance of the rule taken after running snakemake -np:

rule SRA2fastq:
    input: /home/c1477909/ncbi/public/sra/SRR5367779.sra
    output: fastq_files/SRR5367779_1.fastq.gz, fastq_files/SRR5367779_2.fastq.gz
    log: logs/SRA2fastq/SRR5367779.log
    jobid: 18
    wildcards: SRA=SRR5367779

    parallel-fastq-dump --sra-id /home/c1477909/ncbi/public/sra/SRR5367779.sra --threads 8 --outdir fastq_files --split-files --gzip

Note the output files generated by the parallel-fastq-dump command run separately (i.e. not using snakemake) are named as specified in the SRA2fastq rule:

ls fastq_files
SRR5367729_1.fastq.gz  SRR5367729_2.fastq.gz

I'm a bit stumped by this as this error is usually easily rectified but I can't work out what the issue is. I've tried changing the output section of the SRA2fastq to:

    output:
        file1="fastq_files/{SRA}_1.fastq.gz",
        file2="fastq_files/{SRA}_2.fastq.gz"

However, this throws the same error. I've also tried just specifying one output file but this affects the bowtie2 rule later on as I get an input files missing error.

Any ideas what's going on here? Is there something I'm missing when trying to look for multiple output files in a single rule?

Many Thanks

来源：https://stackoverflow.com/questions/51946235/snakemake-error-when-trying-to-generate-multiple-output-files

标签

output

bioinformatics

snakemake