Luigi: how to pass different arguments to leaf tasks?

荒凉一梦 提交于 2020-12-10 08:01:19

问题


This is my second attempt at understanding how to pass arguments to dependencies in Luigi. The first one was here.

The idea is: I have TaskC which depends on TaskB, which depends on TaskA, which depends on Task0. I want this whole sequence to be exactly the same always, except I want to be able to control what file Task0 reads from, lets call it path. Luigi's philosophy is normally that each task should only know about the Tasks it depends on, and their parameters. The problem with this is that TaskC, TaskB, and TaskA all would have to accept variable path for the sole purpose of then passing it to Task0.

So, the solution that Luigi provides for this is called Configuration Classes

Here's some example code:

from pathlib import Path
import luigi
from luigi import Task, TaskParameter, IntParameter, LocalTarget, Parameter

class config(luigi.Config):
    path = Parameter(default="defaultpath.txt")

class Task0(Task):
    path = Parameter(default=config.path)
    arg = IntParameter(default=0)
    def run(self):
        print(f"READING FROM {self.path}")
        Path(self.output().path).touch()
    def output(self): return LocalTarget(f"task0{self.arg}.txt")

class TaskA(Task):
    arg = IntParameter(default=0)
    def requires(self): return Task0(arg=self.arg)
    def run(self): Path(self.output().path).touch()
    def output(self): return LocalTarget(f"taskA{self.arg}.txt")

class TaskB(Task):
    arg = IntParameter(default=0)
    def requires(self): return TaskA(arg=self.arg)
    def run(self): Path(self.output().path).touch()
    def output(self): return LocalTarget(f"taskB{self.arg}.txt")

class TaskC(Task):
    arg = IntParameter(default=0)
    def requires(self): return TaskB(arg=self.arg)
    def run(self): Path(self.output().path).touch()
    def output(self): return LocalTarget(f"taskC{self.arg}.txt")

(Ignore all the output and run stuff. They're just there so the example runs successfully.)

The point of the above example is controlling the line print(f"READING FROM {self.path}") without having tasks A, B, C depend on path.

Indeed, with Configuration Classes I can control the Task0 argument. If Task0 is not passed a path parameter, it takes its default value, which is config().path.

My problem now is that this appears to me to work only at "build time", when the interpreter first loads the code, but not at run time (the details aren't clear to me).

So neither of these work:

A)

if __name__ == "__main__":
    for i in range(3):
        config.path = f"newpath_{i}"
        luigi.build([TaskC(arg=i)], log_level="INFO")

===== Luigi Execution Summary =====

Scheduled 4 tasks of which:
* 4 ran successfully:
    - 1 Task0(path=defaultpath.txt, arg=2)
    - 1 TaskA(arg=2)
    - 1 TaskB(arg=2)
    - 1 TaskC(arg=2)

This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====

I'm not sure why this doesn't work.

B)

if __name__ == "__main__":
    for i in range(3):
        luigi.build([TaskC(arg=i), config(path=f"newpath_{i}")], log_level="INFO")

===== Luigi Execution Summary =====

Scheduled 5 tasks of which:
* 5 ran successfully:
    - 1 Task0(path=defaultpath.txt, arg=2)
    - 1 TaskA(arg=2)
    - 1 TaskB(arg=2)
    - 1 TaskC(arg=2)
    - 1 config(path=newpath_2)

This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====

This actually makes sense. There's two config classes, and I only managed to change the path of one of them.

Help?

EDIT: Of course, having path reference a global variable works, but then it's not a Parameter in the usual Luigi sense.

EDIT2: I tried point 1) of the answer below:

config has the same definition

class config(luigi.Config):
    path = Parameter(default="defaultpath.txt")

I fixed mistake pointed out, i.e. Task0 is now:

class Task0(Task):
    path = Parameter(default=config().path)
    arg = IntParameter(default=0)
    def run(self):
        print(f"READING FROM {self.path}")
        Path(self.output().path).touch()
    def output(self): return LocalTarget(f"task0{self.arg}.txt")

and finally I did:

if __name__ == "__main__":
    for i in range(3):
        config.path = Parameter(f"file_{i}")
        luigi.build([TaskC(arg=i)], log_level="WARNING")

This doesn't work, Task0 still gets path="defaultpath.txt".


回答1:


So what you're trying to do is create tasks with params without passing these params to the parent class. That is completely understandable, and I have been annoyed at times in trying to handle this.

Firstly, you are using the config class incorrectly. When using a Config class, as noted in https://luigi.readthedocs.io/en/stable/configuration.html#configuration-classes, you need to instantiate the object. So, instead of:

class Task0(Task):
    path = Parameter(default=config.path)
    ...

you would use:

class Task0(Task):
    path = Parameter(default=config().path)
    ...

While this now ensures you are using a value and not a Parameter object, it still does not solve your problem. When creating the class Task0, config().path would be evaluated, therefore it's not assigning the reference of config().path to path, but instead the value when called (which will always be defaultpath.txt). When using the class in the correct manner, luigi will construct a Task object with only luigi.Parameter attributes as the attribute names on the new instance as seen here: https://github.com/spotify/luigi/blob/master/luigi/task.py#L436

So, I see two possible paths forward.

1.) The first is to set the config path at runtime like you had, except set it to be a Parameter object like this:

config.path = luigi.Parameter(f"newpath_{i}")

However, this would take a lot of work to get your tasks using config.path working as now they need to take in their parameters differently (can't be evaluated for defaults when the class is created).

2.) The much easier way is to simply specify the arguments for your classes in the config file. If you look at https://github.com/spotify/luigi/blob/master/luigi/task.py#L825, you'll see that the Config class in Luigi, is actually just a Task class, so you can anything with it you could do with a class and vice-versa. Therefore, you could just have this in your config file:

[Task0]
path = newpath_1
...

3.) But, since you seem to be wanting to run multiple tasks with the different arguments for each, I would just recommend passing in args through the parents as Luigi encourages you to do. Then you could run everything with:

luigi.build([TaskC(arg=i) for i in range(3)])

4.) Finally, if you really need to get rid of passing dependencies, you can create a ParamaterizedTaskParameter that extends luigi.ObjectParameter and uses the pickle of a task instance as the object.

Of the above solutions, I highly suggest either 2 or 3. 1 would be difficult to program around, and 4 would create some very ugly parameters and is a bit more advanced.

Edit: Solutions 1 and 2 are more of hacks than anything, and it is just recommended that you bundle parameters in DictParameter.



来源:https://stackoverflow.com/questions/64958830/luigi-how-to-pass-different-arguments-to-leaf-tasks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!