How to select all files from one sample?

牧云@^-^@ 提交于 2019-12-02 05:48:59

问题


I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.

rule MarkDup:
    input:
        expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
            samples=samples['sample'],
            lanes=samples['lane'],
            flowcells=samples['flowcell']),
    output:
        bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
        metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
    shell:
        "gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
        MarkDuplicates \
        $(echo ' {input}' | sed 's/ / --INPUT /g') \
        -O {output.bam} \
        --VALIDATION_STRINGENCY LENIENT \
        --METRICS_FILE {output.metrics} \
        --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
        --CREATE_INDEX true \
        --TMP_DIR Outputs/MarkDuplicates/tmp"

Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:

expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
            lanes=samples['lane'],
            flowcells=samples['flowcell']),`

but this broke the previous rule somehow. So the solution is something like

input:
     "{sample}_*.bam"

But clearly this doesn't work. Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?


回答1:


If you just want all the files in the directory, you can use a lambda function

from glob import glob

rule MarkDup:
    input:
        lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
    output:
        bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
        metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
    shell:
        ...

Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).




回答2:


If I understand correctly, zip needs to be applied only to {lane} and {flowcells} and not to {samples}. In that case, use two expand instances can achieve that.

input:
    expand(expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", 
        zip, lanes=samples['lane'], flowcells=samples['flowcell']), 
            samples=samples['sample'])

PS: output.tmp file uses {sample} instead of {samples}. Typo?



来源:https://stackoverflow.com/questions/54711807/how-to-select-all-files-from-one-sample

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!