问题
I'm not understanding the resiliency model in dask distributed.
Problem
Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception.
Expected Behavior
Reading here: http://distributed.dask.org/en/latest/resilience.html#user-code-failures Suggests that exceptions should be contained to workers and that subsequent tasks would go on without interruption.
"When a function raises an error that error is kept and transmitted to the client on request. Any attempt to gather that result or any dependent result will raise that exception...This does not affect the smooth operation of the scheduler or worker in any way."
I was following the embarrassingly parallel use case here: http://docs.dask.org/en/latest/use-cases.html
Reproducible example
import numpy as np
np.random.seed(0)
from dask import compute, delayed
from dask.distributed import Client, LocalCluster
def raise_exception(x):
if x == 10:
raise ValueError("I'm an error on a worker")
elif x == 20:
print("I've made it to 20")
else:
return(x)
if __name__=="__main__":
#Create cluster
cluster = LocalCluster(n_workers=2,threads_per_worker=1)
client = Client(cluster)
values = [delayed(raise_exception)(x) for x in range(0,100)]
results=compute(*values,scheduler='distributed')
Task 20 is never accomplished. The exception on task 10 causes the scheduler and workers to die. What am I not understanding about the programming model? Why does this count as gathering? I just want to run each task and capture any exceptions for later inspection, not raise them on the client.
Use Case
Parallel image processing on a University SLURM cluster. My function has a side-effect that saves processed images to file. The processes are independent and never gathered by the scheduler. The exception causes all nodes to die on the cluster.
Cross-listed on issues, since I'm not sure if this is a bug or a feature!
https://github.com/dask/distributed/issues/2436
回答1:
Answered in repo - dask delayed computes all-or-nothing. Use dask map from concurrent futures interface + wait. This was designed, not a bug.
https://github.com/dask/distributed/issues/2436
来源:https://stackoverflow.com/questions/53892940/how-to-prevent-dask-client-from-dying-on-worker-exception