问题
I'm writing a snakemake pipeline to take publicly available sra files, convert them to fastq files then run them through alignment, peak calling and LD score regression.
I'm having an issue in the rule called SRA2fastq
below in which I use parallel-fastq-dump
to convert SRA files to paired end fastq files. This rule generates two outputs for each SRA file, SRRXXXXXXX_1
, and SRRXXXXXXX_2
.
Here is my config file:
samples:
fullard2018_NpfcATAC_1: SRR5367824
fullard2018_NpfcATAC_2: SRR5367798
fullard2018_NpfcATAC_3: SRR5367778
fullard2018_NpfcATAC_4: SRR5367754
fullard2018_NpfcATAC_5: SRR5367729
And here are the first few rules of my Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
"FastQC/fastq_multiqc.html",
expand("peak_files/{sample}_peaks.blrm.narrowPeak", sample=config['samples']),
"peak_files/Fullard2018_peaks.mrgd.blrm.narrowPeak",
expand("LD_annotation_files/Fullard_2018.{chr}.l2.ldscore.gz", chr=range(1,23))
rule SRA_prefetch:
params:
SRA="{SRA}"
output:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
log:
"logs/prefetch/{SRA}.log"
shell:
"prefetch {params.SRA}"
rule SRA2fastq:
input:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
output:
"fastq_files/{SRA}_1.fastq.gz",
"fastq_files/{SRA}_2.fastq.gz"
log:
"logs/SRA2fastq/{SRA}.log"
shell:
"""
parallel-fastq-dump --sra-id {input} --threads 8 \
--outdir fastq_files --split-files --gzip
"""
rule fastqc:
input:
rules.SRA2fastq.output
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{SRA}_{num}_fastqc.html"
log:
"logs/FASTQC/{SRA}_{num}.log"
wrapper:
"0.27.1/bio/fastqc"
rule multiqc_fastq:
input:
lambda wildcards: expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.27.1/bio/multiqc"
rule bowtie2:
input:
sample=lambda wildcards: expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=config['samples'][wildcards.sample], num=[1,2])
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.27.1/bio/bowtie2/align"
However, when I run the Snakefile I get the following error:
Error in job SRA2fastq while creating output files fastq_files/SRR5367754_1.fastq.gz, fastq_files/SRR5367754_2.fastq.gz
I've seen this error many times before and it's usually caused when the name of output file generated by the program does not exactly match the output file name you specify in the corresponding snakemake rule. However, this is not the case here as if I run the command snakemake generates for this particular rule separately the files are created as expected and the file names match. Here is an example of one instance of the rule taken after running snakemake -np
:
rule SRA2fastq:
input: /home/c1477909/ncbi/public/sra/SRR5367779.sra
output: fastq_files/SRR5367779_1.fastq.gz, fastq_files/SRR5367779_2.fastq.gz
log: logs/SRA2fastq/SRR5367779.log
jobid: 18
wildcards: SRA=SRR5367779
parallel-fastq-dump --sra-id /home/c1477909/ncbi/public/sra/SRR5367779.sra --threads 8 --outdir fastq_files --split-files --gzip
Note the output files generated by the parallel-fastq-dump
command run separately (i.e. not using snakemake) are named as specified in the SRA2fastq
rule:
ls fastq_files
SRR5367729_1.fastq.gz SRR5367729_2.fastq.gz
I'm a bit stumped by this as this error is usually easily rectified but I can't work out what the issue is. I've tried changing the output section of the SRA2fastq
to:
output:
file1="fastq_files/{SRA}_1.fastq.gz",
file2="fastq_files/{SRA}_2.fastq.gz"
However, this throws the same error. I've also tried just specifying one output file but this affects the bowtie2
rule later on as I get an input files missing
error.
Any ideas what's going on here? Is there something I'm missing when trying to look for multiple output files in a single rule?
Many Thanks
来源:https://stackoverflow.com/questions/51946235/snakemake-error-when-trying-to-generate-multiple-output-files