问题
I want a rule to perform realignment between normal and tumor. The main problem is I don't know how to manage that problem. Is it the wildcard or the expand the answer to my problem?
This is my list of samples:
conditions:
pair1:
tumor: "432"
normal: "433"
So the rule need to be something like this
rule gatk_RealignerTargetCreator:
input:
expand("mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['tumor']),
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['normal']),
output:
"mapped_reads/merged_samples/{pair1}.realign.intervals"
How can I do this operation for all keys on conditions? (I suppose to have more that one pair)
I have tried this code:
input:
lambda wildcards: config["conditions"][wildcards.condition],
tumor= expand("mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",tumor=config['conditions'][wildcards.condition]['tumor']),
normal = expand("mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",normal=config['conditions'][wildcards.condition]['normal']),
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
name 'wildcards' is not defined
??
回答1:
wildcards
is not "directly" defined in the input of a rule. You need to use a function of wildcards instead. I'm not sure I understand exactly what you want to do, but you may try something like that.
def condition2tumorsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['tumor'])
def condition2normalsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['normal'])
rule gatk_RealignerTargetCreator:
input:
tumor = condition2tumorsamples,
normal = condition2normalsamples,
output:
"mapped_reads/merged_samples/{condition}.realign.intervals"
# remainder of the rule here...
回答2:
DISCLAIMER: You want to read your pairings from a YAML file, however, I advise against this. I couldn't figure out how to do it elegantly using YAML formatting. I have an ad-hoc way of doing it to pair my SNP and INDEL annotations, however, there is a lot of boiler plate code JUST so it can write it from the YAML. This was okay because the YAML variable is likely never edited, so maintenance in a pedantically formatted string is no longer important in this case.
I think the code you tried is just about right. What I think is missing is the ability to "request" the correct pairings in your "rule all" input. I personally prefer to do this using Pandas. It is listed on the homepage of the Python Software Foundation, so it's a robust choice.
The pandas setup is very easy to maintain, it's a single file tab or space separated. Easier for the end user than formatting nest YAML files (What I think would be required if setup via YAML format). This is how I do it in my system. It scales indefinitely. I'll admit accessing the pandas object is a bit tricky, but I've provided the code for you. Just know that first layer of objects (The [#] in the 'sample[1][tumor]' call), the [0] I think is just meta data on the file being read. I have yet to find a use for it and otherwise just ignore it.
tree structure of workspace
(CentOS5-Compatible) [tboyarski@login3 Test]$ tree
.
|-- [tboyarsk 620 Aug 4 10:57] Snakefile
|-- [tboyarsk 47 Aug 4 10:52] config.yaml
|-- [tboyarsk 512 Aug 4 10:57] output
| |-- [tboyarsk 0 Aug 4 10:54] ABC.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimNorm.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimTum.bam
| `-- [tboyarsk 0 Aug 4 10:57] XYZ.bam
`-- [tboyarsk 36 Aug 4 10:49] sampleFILEpair.txt
sampleFILEpair.txt (Proof the sample names can be unrelated)
tumor normal
TimTum TimNorm
XYZ ABC
config.yaml
pathDIR: output
sampleFILE: sampleFILEpair.txt
Snakefile
from pandas import read_table
configfile: "config.yaml"
rule all:
input:
expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows())
rule gatk_RealignerTargetCreator:
input:
"{pathGRTC}/{normal}.bam",
"{pathGRTC}/{tumor}.bam",
output:
"{pathGRTC}/{tumor}_{normal}.bam"
# wildcard_constraints:
# tumor = '[^_|-|\/][0-9a-zA-Z]*',
# normal = '[^_|-|\/][0-9a-zA-Z]*'
run:
call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)
With the merging of wildcards, in the past, I have found it to be a source of cyclical dependencies, so I also always include wildcard_constraints when merging (essentially that's what we're doing). They aren't actually necessary here. The "rule all" contains no wildcards, and it is calling "gatk", so in this exact example where is no room for ambiguity, but if this rule connects with other rules utilizing wildcards, usually it can generate some funky DAG's.
来源:https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command