What do KilledWorker exceptions mean in Dask?

柔情痞子 提交于 2019-12-08 17:34:32

问题


My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?


回答1:


This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly. It is designed to protect the cluster against tasks that kill workers, for example by segfaults or memory errors.

Whenever a worker dies unexpectedly the scheduler notes which tasks were running on that worker when it died. It retries those tasks on other workers but also marks them as suspicious. If the same task is present on several workers when they die then eventually the scheduler will give up on trying to retry this task, and instead marks it as failed with the exception KilledWorker.

Often this means that your task has some other issue. Perhaps it causes a segmentation fault or allocates too much memory. Perhaps it uses a library that is not threadsafe. Or perhaps it is just very unlucky. Regardless, you should inspect your worker logs to determine why your workers are failing. This is likely a bigger issue than your task failing.

You can control this behavior by modifying the following entry in your ~/.config/dask/distributed.yaml file.

allowed-failures: 3     # number of retries before a task is considered bad


来源:https://stackoverflow.com/questions/46691675/what-do-killedworker-exceptions-mean-in-dask

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!