How to prevent dask client from dying on worker exception?

可紊 提交于 2021-01-29 08:12:12

问题


I'm not understanding the resiliency model in dask distributed.

Problem

Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception.

Expected Behavior

Reading here: http://distributed.dask.org/en/latest/resilience.html#user-code-failures Suggests that exceptions should be contained to workers and that subsequent tasks would go on without interruption.

"When a function raises an error that error is kept and transmitted to the client on request. Any attempt to gather that result or any dependent result will raise that exception...This does not affect the smooth operation of the scheduler or worker in any way."

I was following the embarrassingly parallel use case here: http://docs.dask.org/en/latest/use-cases.html

Reproducible example

import numpy as np
np.random.seed(0)

from dask import compute, delayed
from dask.distributed import Client, LocalCluster

def raise_exception(x):
    if x == 10:
        raise ValueError("I'm an error on a worker")
    elif x == 20:
        print("I've made it to 20")
    else:
        return(x)


if __name__=="__main__":

    #Create cluster
    cluster = LocalCluster(n_workers=2,threads_per_worker=1)
    client = Client(cluster)

    values = [delayed(raise_exception)(x) for x in range(0,100)]
    results=compute(*values,scheduler='distributed')

Task 20 is never accomplished. The exception on task 10 causes the scheduler and workers to die. What am I not understanding about the programming model? Why does this count as gathering? I just want to run each task and capture any exceptions for later inspection, not raise them on the client.

Use Case

Parallel image processing on a University SLURM cluster. My function has a side-effect that saves processed images to file. The processes are independent and never gathered by the scheduler. The exception causes all nodes to die on the cluster.

Cross-listed on issues, since I'm not sure if this is a bug or a feature!

https://github.com/dask/distributed/issues/2436


回答1:


Answered in repo - dask delayed computes all-or-nothing. Use dask map from concurrent futures interface + wait. This was designed, not a bug.

https://github.com/dask/distributed/issues/2436



来源:https://stackoverflow.com/questions/53892940/how-to-prevent-dask-client-from-dying-on-worker-exception

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!