ducttape sometimes-skip task: cross-product error

大憨熊 提交于 2019-12-11 03:00:32

问题


I'm trying a variant of sometimes-skip tasks for ducttape, based on the tutorial here: http://nschneid.github.io/ducttape-crash-course/tutorial5.html

([ducttape][1] is a Bash/Scala based workflow management tool.)

I'm trying to do a cross-product to execute task1 on "clean" data and "dirty" data. The idea is to traverse the same path, but without preprocessing in some cases. To do this, I need to do a cross-product of tasks.

task cleanup < in=(Dirty: a=data/a b=data/b) > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out
}

global {
    data=(Data: dirty=(Dirty: a=data/a b=data/b) clean=(Clean: a=$out@cleanup b=$out@cleanup))
}

task task1 < in=$data > out 
{ 
    cat $in > $out
}

plan FinalTasks {
    reach task1 via (Dirty: *) * (Data: *) * (Clean: *)
}

Here is the execution plan. I would expect 6 tasks, but I have two duplicate tasks being executed.

$ ducttape skip.tape
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 7 previous workflow versions
Finding hyperpaths contained in plan...
Found 8 vertices implied by realization plan FinalTasks
Union of all planned vertices has size 8
Checking for completed tasks from versions 1 through 7...
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Baseline.baseline (Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Dirty.b (Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Baseline.baseline (Data.dirty+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Dirty.b (Data.dirty+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean+Dirty.b (Clean.b+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean+Dirty.b (Clean.a+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean (Clean.a+Data.clean+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean (Clean.b+Data.clean+Dirty.a)
Are you sure you want to run these 8 tasks? [y/n] 

Removing the symlinks from the output below, my duplicates are here:

$ head task1/*/out
==> Baseline.baseline/out <==
1

==> Clean.b+Data.clean/out <==
1-clean
==> Data.clean/out <==
1-clean

==> Clean.b+Data.clean+Dirty.b/out <==
2-clean
==> Data.clean+Dirty.b/out <==
2-clean

==> Dirty.b/out <==
2

Could someone with experience with ducttape assist me in finding my cross-product problem?

  [1]: https://github.com/jhclark/ducttape

回答1:


So why do we have 4 realizations involving the branch point Clean at task1 instead of just two?

The answer to this question is that the in ducttape branch points are always propagated through all transitive dependencies of a task. So the branch point "Dirty" from the task "cleanup" is propagated through clean=(Clean: a=$out@cleanup b=$out@cleanup). At this point the variable "clean" contains the cross product of the original "Dirty" and the newly-introduced "Clean" branch point.

The minimal change to make is to change

clean=(Clean: a=$out@cleanup b=$out@cleanup)

to

clean=$out@cleanup

This would give you the desired number of realizations, but it's a bit confusing to use the branch point name "Dirty" just to control which input data set you're using -- with only this minimal change, the two realizations of the task "cleanup" would be (Dirty: a b).

It may make your workflow even more grokkable to refactor it like this:

global {
    raw_data=(DataSet: a=data/a b=data/b)
}

task cleanup < in=$raw_data > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out
}
global {
    ready_data=(DoCleanup: no=$raw_data yes=$out@cleanup)
}

task task1 < in=$ready_data > out 
{ 
    cat $in > $out
}

plan FinalTasks {
    reach task1 via (DataSet: *) * (DoCleanup: *)
}


来源:https://stackoverflow.com/questions/23698707/ducttape-sometimes-skip-task-cross-product-error

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!