Snakemake: confusion on how to access config files properly

问题

This question follows on from a question I asked previously and it regards understanding how to access config files correctly using Snakemake. I have a specific problem I need to address which I'll ask first and a general problem understanding how indexing works which I'll ask second.

I'm using snakemake to run and ATAC-seq pipeline from Alignment/QC through to motif analysis.

A: Specific Question

I'm trying to add a rule called trim_galore_pe to trim adapters from my fastq files before alignment and an error statement is thrown from snakemake as the names of the output files generated by trim galore do not match what is expected by snakemake. This is because I cannot work out how to write the output file statement correctly in my snakemake file to make the names match.

An example of the names generated by TRIM GALORE contain SRA numbers, for example:

trimmed_fastq_files/SRR2920475_1_val_1.fq.gz

Whereas the file expected by snakemake contain sample references and should read:

trimmed_fastq_files/Corces2016_4983.7B_Mono_1_val_1.fq.gz

This also affects the subsequent rules after the trim_galore_pe rule. I need to work out a way to use the info in my config file to generate the output files required.

For all rules after those shown in the Snakefile I need files to be named by sample name i.e. Corces2016_4983.7A_Mono. It would also be useful for all the FAST_QC and MULTIQC rules shown in the Snakefile below to have the sample names in the output file name structure, which they all already do in the current Snakefile.

However, inputs for Bowtie2, the FASTQC rules and input and output of the trim_galore_pe rules need to contain the SRA numbers. The problem starts at the output of trim_galore and influences all downstream rules.

Although I have extracted SRA numbers in previous rules, I'm not sure how to do this when not using the fastq_files folder which is explicitly stated in the config file. By introducing the trim_galore_pe rule I have effectively moved a new set of SRA files into the new trimmed_fastq_files folder. How to extract only the SRA number from the list of SRA files config file containing the old folder names whilst referencing the new trimmed_fastq_files folder in the Snakefile is the crux of my issue.

I hope this is clear.

Here is my config file:

samples:
    Corces2016_4983.7A_Mono: fastq_files/SRR2920475
    Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19

Here is my Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
        expand("bam_files/{sample}.bam", sample=config["samples"]),
        "FastQC/PRETRIM/fastq_multiqc.html",
        "FastQC/POSTTRIM/fastq_multiqc.html"

rule fastqc_pretrim:
    input:
        sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_pretrim:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/PRETRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz",
        "trimmed_fastq_files/{sample}_1.fastq.gz_trimming_report.txt",
        "trimmed_fastq_files/{sample}_2_val_2.fq.gz",
        "trimmed_fastq_files/{sample}_2.fastq.gz_trimming_report.txt"
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

rule fastqc_posttrim:
    input:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_posttrim:
    input:
        expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/POSTTRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule bowtie2:
    input:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
       index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
       extra=""
    threads: 8
    wrapper:
        "0.23.1/bio/bowtie2/align"

This currently runs, and give a full job list using snakemake -np, but throws the error mentioned above.

B: General question

Is there an online resource that explains succinctly how to reference a config file using python, particularly with reference to snakemake? The online docs are pretty insufficient and assume prior knowledge of python.

My programming experience is mainly in bash and R but I like Snakemake and do generally understand how dictionaries and lists work in python and how to reference items stored within them. However I find the complex use of bracketing, wildcards and inverted commas in some of the Snakemake rules above confusing so tend to struggle when trying to reference different parts of file names in the config file. I want to understand fully how to utilise these elements.

For example, in a rule such as this from the Snakefile posted above:

sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

What is actually happening in this command? My understanding is that we are accessing the config file using config['samples'] and we are using the [wildcards.sample] part to explicitly access the fastq_files/SRR2920475 part of the config file. The expand allows us to iterate through each item in the config file that fit the parameters in the command, i.e all the SRA files, and the lambda wildcards is needed to use the wildcards call in the command. What I'm uncertain about is:

What does the f do just after the expand and why is it needed?
Why does config['samples'] contain inverted commas within the square brackets but but inverted commas are not needed around [wildcards.sample]?
Why are the single and double curly brackets used?
Looking at the Snakefile above, a few of the rules contain parts assigning a sequence of numbers to num, but again these numbers are sometimes enclosed around inverted commas and sometimes not...why?

Any advice, tips, pointers would be greatly appreciated.

C: Clarification on suggestions made below by @bli

I have edited my config file as you suggested in your comment and omitted the folder names leaving only the SRA numbers. This make sense to me, but I have a couple of other issues preventing me getting this Snakefile running.

New config file:

samples:
    Corces2016_4983.7A_Mono: SRR2920475
    Corces2016_4983.7B_Mono: SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/c1477909/genomes_and_index_files/hg19

New Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
        expand("bam_files/{sample}.bam", sample=config["samples"]),
        "FastQC/PRETRIM/fastq_multiqc.html",
        "FastQC/POSTTRIM/fastq_multiqc.html",

rule fastqc_pretrim:
    input:
      lambda wildcards: f"fastq_files/{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_pretrim:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/PRETRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule trim_galore_pe:
    input:
        lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        "trimmed_fastq_files/{wildcards.sample}_1_val_1.fq.gz",
        "trimmed_fastq_files/{wildcards.sample}_1.fastq.gz_trimming_report.txt",
        "trimmed_fastq_files/{wildcards.sample}_2_val_2.fq.gz",
        "trimmed_fastq_files/{wildcards.sample}_2.fastq.gz_trimming_report.txt"
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

rule fastqc_posttrim:
    input:
        lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_posttrim:
    input:
        expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/POSTTRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule bowtie2:
    input:
        lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
        index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
        extra=""
    threads: 8
    wrapper:
        "0.23.1/bio/bowtie2/align"

Using these new files everything worked fine initially, a partial job list was created by snakemake -np. However, this is because half of the complete job list had already been run; that is the trimmed_fastq_files folder was generated and the correctly named trimmed fastq files were in place within it. When I deleted all the previously created files to see if the entire new version of the Snakefile would work properly, snakemake -np failed, stating that there were missing input files for the rules downstream of the trim_galore_pe rule.

As you can see I'm trying to call the {wildcard.sample} variable set in the input section of the trim_galore_pe rule in the output section, but snakemake doesn't like this. Is is possible to do this?

I also tried this using the tips from the answers below but this didn't work either:

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2]),
        expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz_trimming_report.txt", num=[1,2])
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

The error then stated wildcards not defined. So, logically I tried putting lambda wildcards: in front of the two expand statements of the output section in an attempt to define the wildcards, but this threw a syntax error, Only input files can be specified as functions. I also tried using some of the indexing suggestions below but couldn't get the right combination.

This is probably caused by another thing I'm unsure about regarding Snakefiles and that is how scoping works.

If I define a variable in rule all can all other rules access it?
If I define a variable in the input section of a rule, is it available to all other sections of that rule (i.e. output, shell command etc.), but only that rule?
If yes, why can't I access the {wildcard.sample} variable if I defined it in the input section? Is that because that variable is enclosed within a 'closed' scope lambda function?

Any (further) suggestions would be greatly appreciated.

回答1:

I'll try to answer your question B, and give extra details that I hope can be useful for you and others.

Edit: I added some attempts at answering question C at the end.

Regarding quotes

First, regarding what you call "inverted" commas, they are usually called "single quotes", and they are used in python to build strings. Double quotes can also be used for the same purpose. The main difference is when you try to create strings that contain quotes. Using double quotes allows you to create strings containing single quotes, and vice-versa. Otherwise, you need to "escape" the quote using backslashes ("\"):

s1 = 'Contains "double quotes"'
s1_bis = "Contains \"double quotes\""
s2 = "Contains 'single quotes'"
s2_bis = 'Contains \'single quotes\''

(I tend to prefer double quotes, that's just a personal taste.)

Decomposing the example

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

You are assigning a function (lambda wildcards: ...) to a variable (sample), which happens to belong to the input section of a rule.

This will cause snakemake to use this function when it comes to determine the input of a particular instance of the rule, based on the current values of the wildcards (as inferred from the current value of the output it wants to generate).

For clarity, one could very likely rewrite this by separating the function definition from the rule declaration, without using the lambda construct, and it would work identically:

def determine_sample(wildcards):
    return expand(
        f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
        num=[1,2])

rule trim_galore_pe:
    input:
        sample = determine_sample

expand is a snakemake-specific function (but you can import it in any python program or interactive interpreter with from snakemake.io import expand), that makes it easier to generate lists of strings. In the following interactive python3.6 session we will try to reproduce what happens when you use it, using different native python constructs.

Accessing the configuration

# We'll try to see how `expand` works, we can import it from snakemake
from snakemake.io import expand
    
# We want to see how it works using the following example
# expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

# To make the example work, we will first simulate the reading
# of a configuration file
import yaml

config_text = """
samples:
    Corces2016_4983.7A_Mono: fastq_files/SRR2920475
    Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19
"""
# Here we used triple quotes, to have a readable multi-line string.

# The following is equivalent to what snakemake does with the configuration file:
config = yaml.load(config_text)
config

Output:

{'cell_types': {'Mono': ['Corces2016_4983.7A']},
 'index': '/home/genomes_and_index_files/hg19',
 'samples': {'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
  'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}}

We obtained a dictionary in which the key "samples" is associated with a nested dictionary.

# We can access the nested dictionary as follows
config["samples"]
# Note that single quotes could be used instead of double quotes
# Python interactive interpreter uses single quotes when it displays strings

Output:

{'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
 'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}

# We can access the value corresponding to one of the keys
# again using square brackets
config["samples"]["Corces2016_4983.7A_Mono"]

Output:

'fastq_files/SRR2920475'

# Now, we will simulate a `wildcards` object that has a `sample` attribute
# We'll use a namedtuple for that
# https://docs.python.org/3/library/collections.html#collections.namedtuple
from collections import namedtuple
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
wildcards.sample

Output:

'Corces2016_4983.7A_Mono'

Edit (15/11/2018): I found out a better way of creating wildcards:

from snakemake.io import Wildcards
wildcards = Wildcards(fromdict={"sample": "Corces2016_4983.7A_Mono"})

# We can use this attribute as a key in the nested dictionary
# instead of using directly the string
config["samples"][wildcards.sample]
# No quotes here: `wildcards.sample` is a string variable

Output:

'fastq_files/SRR2920475'

Deconstructing the `expand`

# Now, the expand of the example works, and it results in a list with two strings
expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])    
# Note: here, single quotes are used for the string "sample",
# in order not to close the opening double quote of the whole string

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

# Internally, I think what happens is something similar to the following:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"

# This template is then used for each element of this "list comprehension"    
[filename_template.format(num=num) for num in [1, 2]]

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

# This is equivalent to building the list using a for loop:
filenames = []
for num in [1, 2]:
    filename = filename_template.format(num=num)
    filenames.append(filename)
filenames

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

String templates and formatting

# It is interesting to have a look at `filename_template`    
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'

# The part between curly braces can be substituted
# during a string formatting operation:
"fastq_files/SRR2920475_{num}.fastq.gz".format(num=1)

Output:

'fastq_files/SRR2920475_1.fastq.gz'

Now let's further show how string formatting can be used.

# In python 3.6 and above, one can create formatted strings    
# in which the values of variables are interpreted inside the string    
# if the string is prefixed with `f`.
# That's what happens when we create `filename_template`:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"    
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'

Two substitutions happened during the formatting of the string:

The value of config['samples'][wildcards.sample] was used to make the first part of the string. (Single quotes were used around sample because this python expression was inside a string built with double quotes.)
The double brackets around num were reduced to single ones as part of the formatting operation. That's why we can then use this again in further formatting operations involving num.

# Equivalently, without using 3.6 syntax:    
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
    filename_prefix = config["samples"][wildcards.sample])
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'

# We could achieve the same by first extracting the value
# from the `config` dictionary    
filename_prefix = config["samples"][wildcards.sample]
filename_template = f"{filename_prefix}_{{num}}.fastq.gz"
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'

# Or, equivalently:
filename_prefix = config["samples"][wildcards.sample]
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
    filename_prefix=filename_prefix)
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'

# We can actually perform string formatting on several variables
# at the same time:
filename_prefix = config["samples"][wildcards.sample]
num = 1
"{filename_prefix}_{num}.fastq.gz".format(
    filename_prefix=filename_prefix, num=num)

Output:

'fastq_files/SRR2920475_1.fastq.gz'

# Or, using 3.6 formatted strings
filename_prefix = config["samples"][wildcards.sample]
num = 1
f"{filename_prefix}_{num}.fastq.gz"

Output:

'fastq_files/SRR2920475_1.fastq.gz'

# We could therefore build the result of the expand in a single step:
[f"{config['samples'][wildcards.sample]}_{num}.fastq.gz" for num in [1, 2]]

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

Comments about question C

The following is a bit complex, in terms of how Python will build the string:

input:
    lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

But it should work, as we can see in the following simulation:

from collections import namedtuple
from snakemake.io import expand

Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
config = {"samples": {
    "Corces2016_4983.7A_Mono": "SRR2920475",
    "Corces2016_4983.7B_Mono": "SRR2920476"}}
expand(
    f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
    num=[1,2])

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

The problem in the trim_galore_pe rule is actually in its output section: You shouldn't use {wildcards.sample} there, but just {sample}.

The output section of a rule is where you inform snakemake of what the wildcards attributes will be for a given instance of the rule, by matching the file it wants to obtain with the patterns given. The parts matching the curly braces will be used to set the values of the corresponding attribute name.

For instance, if snakemake wants a file called "trimmed_fastq_files/Corces2016_4983.7A_Mono_1_val_1.fq.gz", it will try to match this against all patterns present in all rule's output sections and eventually find this one: "trimmed_fastq_files/{sample}_1_val_1.fq.gz"

Luckily, it will be able to match the filename with the pattern by establishing a correspondence between Corces2016_4983.7A_Mono and the {sample} part. It will then put a sample attribute in the local wildcards instance, a bit like if I was manually doing the following:

Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")

I don't know what happens exactly in snakemake if you use {wildcards.sample} instead of {wildcards}, but let's try with my simulation framework:

Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(wildcards.sample="Corces2016_4983.7A_Mono")
  File "<ipython-input-12-c02ce12bff85>", line 1
    wildcards = Wildcards(wildcards.sample="Corces2016_4983.7A_Mono")
                         ^
SyntaxError: keyword can't be an expression

What about your following attempt?

output:
    expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2]),

Here, my understanding is that Python first tries to apply the f string formatting on f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz". To do that, it will need to be able to evaluate config['samples'][wildcards.sample], but the wildcards object does not exist yet. Hence the wildcards not defined. wildcards would be generated only after matching the name of a file needed by a "downstream" rule with a string containing {attribute_name} patterns. But this is the string that snakemake is currently trying to build.

Here are some important points to remember:

wildcards actually exist only locally, in an instance of a rule, after having matched its output with a file required by another "downstream" rule instance.
You don't define variables in input sections. You use variables to build the concrete names of the files that the rule instance will need (or, more precisely, that you say you want to exist before the rule instance can be run: the rule does not need to actually use those files). Those variables are those defined outside the scope of the rules, directly at the ground level of the snakefile, in pure Python mode, and the local wildcards object. By default, the {attribute_name} placeholders will be substituted by the attributes of the local wildcards object ("{sample}" becomes "Corces2016_4983.7A_Mono"), but if you want to do more complicated stuff to build the file names, you need to do this via a function that will have to explicitly handle this wildcards object (lambda wildcards: f"{wildcards.sample}" becomes "Corces2016_4983.7A_Mono").

回答2:

For the A part of your question, I would suggest using tab-delimited files in order to store sample information, ID, etc, whatever amount of samples you have. You can then store the path to this file in your config file, and access it by using column identifiers with the pandas library. See this thread for more detail.

As for part B:

Looks like it's a new way to format string in python, see this link for ex.
config['samples'] is intended to access the config file's key "samples", while [wildcards.sample] is intended to access the wildcard {sample} as defined in your workflow (in rule all).
Double brackets are generally used to escape simple brackets. Here it avoids expanding {{num}} (can't put my hands on the example from snakemake's documentation).
num=['1', '2'] is assigning a list of values to the wildcard {num}, so later it can be accessed by using {wildcards.num}. If all rules' inputs and outputs are linked you shouldn't have to define these values more than once.