snakemake wildcards or expand command

牧云@^-^@ 提交于 2019-12-03 21:55:05

问题


I want a rule to perform realignment between normal and tumor. The main problem is I don't know how to manage that problem. Is it the wildcard or the expand the answer to my problem?

This is my list of samples:

conditions:
   pair1:
        tumor: "432"
        normal: "433"

So the rule need to be something like this

rule gatk_RealignerTargetCreator:
    input:
        expand("mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['tumor']),
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['normal']),

    output:
        "mapped_reads/merged_samples/{pair1}.realign.intervals"

How can I do this operation for all keys on conditions? (I suppose to have more that one pair)

I have tried this code:

    input:
        lambda wildcards: config["conditions"][wildcards.condition],
        tumor= expand("mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",tumor=config['conditions'][wildcards.condition]['tumor']),
        normal = expand("mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",normal=config['conditions'][wildcards.condition]['normal']),

    output:
        "mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"

name 'wildcards' is not defined

??


回答1:


wildcards is not "directly" defined in the input of a rule. You need to use a function of wildcards instead. I'm not sure I understand exactly what you want to do, but you may try something like that.

def condition2tumorsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['tumor'])

def condition2normalsamples(wildcards):
    return expand(
        "mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
        sample=config['conditions'][wildcards.condition]['normal'])

rule gatk_RealignerTargetCreator:
    input:
        tumor = condition2tumorsamples,
        normal = condition2normalsamples,    
    output:
        "mapped_reads/merged_samples/{condition}.realign.intervals"
    # remainder of the rule here...



回答2:


DISCLAIMER: You want to read your pairings from a YAML file, however, I advise against this. I couldn't figure out how to do it elegantly using YAML formatting. I have an ad-hoc way of doing it to pair my SNP and INDEL annotations, however, there is a lot of boiler plate code JUST so it can write it from the YAML. This was okay because the YAML variable is likely never edited, so maintenance in a pedantically formatted string is no longer important in this case.

I think the code you tried is just about right. What I think is missing is the ability to "request" the correct pairings in your "rule all" input. I personally prefer to do this using Pandas. It is listed on the homepage of the Python Software Foundation, so it's a robust choice.

The pandas setup is very easy to maintain, it's a single file tab or space separated. Easier for the end user than formatting nest YAML files (What I think would be required if setup via YAML format). This is how I do it in my system. It scales indefinitely. I'll admit accessing the pandas object is a bit tricky, but I've provided the code for you. Just know that first layer of objects (The [#] in the 'sample[1][tumor]' call), the [0] I think is just meta data on the file being read. I have yet to find a use for it and otherwise just ignore it.

tree structure of workspace

(CentOS5-Compatible) [tboyarski@login3 Test]$ tree
.
|-- [tboyarsk       620 Aug  4 10:57]  Snakefile
|-- [tboyarsk        47 Aug  4 10:52]  config.yaml
|-- [tboyarsk       512 Aug  4 10:57]  output
|   |-- [tboyarsk         0 Aug  4 10:54]  ABC.bam
|   |-- [tboyarsk         0 Aug  4 10:53]  TimNorm.bam
|   |-- [tboyarsk         0 Aug  4 10:53]  TimTum.bam
|   `-- [tboyarsk         0 Aug  4 10:57]  XYZ.bam
`-- [tboyarsk        36 Aug  4 10:49]  sampleFILEpair.txt

sampleFILEpair.txt (Proof the sample names can be unrelated)

tumor normal
TimTum TimNorm
XYZ ABC

config.yaml

pathDIR: output
sampleFILE: sampleFILEpair.txt

Snakefile

 from pandas import read_table

 configfile: "config.yaml"

 rule all:
     input:
         expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows())


 rule gatk_RealignerTargetCreator:
     input:
         "{pathGRTC}/{normal}.bam",
         "{pathGRTC}/{tumor}.bam",
     output:
         "{pathGRTC}/{tumor}_{normal}.bam"
 #    wildcard_constraints:
 #        tumor = '[^_|-|\/][0-9a-zA-Z]*',
 #        normal = '[^_|-|\/][0-9a-zA-Z]*'
     run:
         call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)

With the merging of wildcards, in the past, I have found it to be a source of cyclical dependencies, so I also always include wildcard_constraints when merging (essentially that's what we're doing). They aren't actually necessary here. The "rule all" contains no wildcards, and it is calling "gatk", so in this exact example where is no room for ambiguity, but if this rule connects with other rules utilizing wildcards, usually it can generate some funky DAG's.



来源:https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!