Using multiple filenames as wildcards in Snakemake

回眸只為那壹抹淺笑 提交于 2019-12-11 06:29:13

问题


I am trying to create a rule to implement bedtools in snakemake, which will closest a file with bunch of files in another directory.

What I have is, under /home/bedfiles directory, 20 bed files:

1A.bed , 2B_83.bed , 3f_33.bed ...

What I want is, under /home/bedfiles directory, 20 modified bed files:

1A_modified,  2B_83_modified , 3f_33_modified ...

So the bash command would be :

filelist='/home/bedfiles/*.bed'
for mfile in $filelist;
do
bedtools closest -a /home/other/merged.txt -b ${mfile} > ${mfile}_modified

So this command would make files with _modified extension, in /home/bedfiles directory.

I want to implement this with Snakemake, however I keep having a syntax error, that I have no idea of how to fix. My trial is:

Step1:Getting the first part of bed files in the directory

FIRSTPART = [f.split(".")[0] for f in os.listdir("/home/bedfiles") if f.endswith('.bed')]

Step2: Defining the output name and folder

MODIFIED = expand("/home/bedfiles/{first}_modified", first=FIRSTPART)

Step3: Writing this in rule all:

rule all:
   input: MODIFIED

Step4: Making a specific rule to implement 'bedtools closest'

rule closest:

    input:
        input1 = "/home/other/merged.txt" , \
        input2 = expand("/home/bedfiles/{first}.bed", first=FIRSTPART) 

    output:
        expand("/home/bedfiles/{first}_modified", first=FIRSTPART)  

    shell:
        """ bedtools closest -a {input.input1} -b {input.input2} > {output} """

And it throws me the error at the line for rule all,input:

invalid syntax

Do you know how to overpass this error or any other way to implement it?

PS : Writing the names of the files one by one is not possible.


回答1:


Remove the call to expand in your definition of input and output in closest. You're currently passing in a vector of 20 filenames as input.input2 and a vector of 20 filenames as output.

That is, your rule closest is currently trying to run once and create 20 files; whereas it should run 20 times and create a single file each time.

In closest you want input.input2 to be a single file and output to be a single file each time that rule is ran:

FIRSTPART = [f.split(".")[0] for f in os.listdir("/home/bedfiles") if f.endswith('.bed')]

print("These are the input files:")
print([f + ".bed" for f in FIRSTPART])

MODIFIED = expand("/home/bedfiles/{first}_modified", first=FIRSTPART)
print("These will be created")
print(MODIFIED)

rule all:
   input: MODIFIED

rule closest:
    message: """
        Converts /home/other/merged.txt and /some/dir/xyz.bed
        into /some/dir/xyz_modified
        """

    input:
        input1 = "/home/other/merged.txt",
        input2 = "{prefix}.bed" 

    output:    "{prefix}_modified"  

    shell:
        """ 
        bedtools closest -a {input.input1} -b {input.input2} > {output}
        """

Here's an experiment:

Move yourself into a temporary directory and within that directory do the following:

mkdir bedfiles                                                                  
touch bedfiles/{a,b,c,d}.bed

Then add a file called Snakefile into your current directory that contains the following code

import os                                                                         
import os.path
import re

input_dir = "bedfiles"
input_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]

print(input_files)                                                                

output_files = [re.sub(".bed$", "_modified", f) for f in input_files]             

print(output_files)                                                               

rule all:                                                                         
    input: output_files                                                           

rule mover:                                                                       
    input: "{prefix}.bed"                                                         
    output: "{prefix}_modified"                                                   
    shell:                                                                        
       """ cp {input} {output} """

Then run it using snakemake at the command line. Snakemake is goal-oriented; it works out how to make your desired outputs based on the existing files.




回答2:


Easy one: invalid syntax refers to a missing , after input1 = "/home/other/merged.txt" Hope it helps Marc



来源:https://stackoverflow.com/questions/48443572/using-multiple-filenames-as-wildcards-in-snakemake

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!