Repeated task execution using the distributed Dask scheduler

痴心易碎 提交于 2019-12-14 01:22:26

问题


I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute().

When the number of tasks is say 20 (a number >> than the number of workers) and each task takes say at least 15 secs, the scheduler starts rerunning some of the tasks (or executes them in parallel more than once).

This is a problem since the tasks modify a SQL db and if they run again they end up raising an Exception (due to DB uniqueness constraints). I'm not setting pure=True anywhere (and I believe the default is False). Other than that, the Dask graph is trivial (no dependencies between the tasks).

Still not sure if this is a feature or a bug in Dask. I have a gut feeling that this might be related to worker stealing...


回答1:


Correct, if a task is allocated to one worker and another worker becomes free it may choose to steal excess tasks from its peers. There is a chance that it will steal a task that has just started to run, in which case the task will run twice.

The clean way to handle this problem is to ensure that your tasks are idempotent, that they return the same result even if run twice. This might mean handling your database error within your task.

This is one of those policies that are great for data intensive computing workloads but terrible for data engineering workloads. It's tricky to design a system that satisfies both needs simultaneously.



来源:https://stackoverflow.com/questions/41965253/repeated-task-execution-using-the-distributed-dask-scheduler

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!