Including unforeseen file names as wildcards in Snakemake

问题

gdc-fastq-splitter splits FASTQ files into read groups. For instance, should 3 different read groups be included in dummy.fq.gz, three fastq files will be generated: dummy_readgroup_1.fq.gz, dummy_readgroup_2.fq.gz, dummy_readgroup_3.fq.gz. Given that each original FASTQ file is in a different folder and contains a different number of read groups, the resulting files cannot be easily inputted in the following step as wildcards.

Taking into account that I do not know the exact name and number of resulting files, is there a way to take output from one rule as wildcards for the next one? An alternative could be to list all the generated files and provide as a list in a parallel Snakefile. I am hoping a more elegant solution.

This is my first ever question in StackOverflow and tried to check all the existing questions. Please, be kind with me if this questions sounds silly or if has been already answered :-)

回答1:

It is not the prettiest, but this is the way it needs to be done:

import random
import glob
from pathlib import Path


SAMPLES = ['dummy', 'dommy']
rule all:
    input:
        [f"do_all_{sample}.out" for sample in SAMPLES]


def aggregate(wildcards):
    checkpoints.fastq_splitter.get(sample=wildcards.sample)
    read_groups = glob_wildcards(f"{wildcards.sample}_{{read_group}}.fastq.gz").read_group
    return [f"bam/{wildcards.sample}_{read_group}.bam" for read_group in read_groups]


rule do_everything:
    input:
        aggregate
    output:
        touch("do_all_{sample}.out")


rule do_sth_splitted:
    input:
        "{sample}_{read_group}.fastq.gz"
    output:
        touch("bam/{sample}_{read_group}.bam")



checkpoint fastq_splitter:
    input:
        "{sample}.fastq.gz"
    output:
        touch("{sample}.done")
    run:
        for i in range(random.randint(1, 5)):
            Path(f'{wildcards.sample}_{i}.fastq.gz').touch()

Before you run make sure the sample files exist: touch d{u,o}mmy.fastq.gz.

In the checkpoint fastq_splitter we generate a random number of "fastq" files. The rule do_sth_splitted we pretend we align this against a genome and we get a bam file for each read group. rule do_everything is there to check what the output is of checkpoint fastq_splitter, and is only evaluated after fastq_splitter is done. rule all is there to make sure everything is run for all samples.

Take a look at checkpoints. for a more proper explanation.

来源：https://stackoverflow.com/questions/58747002/including-unforeseen-file-names-as-wildcards-in-snakemake

标签

snakemake