Problem

Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception.

Expected Behavior

Reading here: http://distributed.dask.org/en/latest/resilience.html#user-code-failures Suggests that exceptions should be contained to workers and that subsequent tasks would go on without interruption.

"When a function raises an error that error is kept and transmitted to the client on request. Any attempt to gather that result or any dependent result will raise that exception...This does not affect the smooth operation of the scheduler or worker in any way."

I was following the embarrassingly parallel use case here: http://docs.dask.org/en/latest/use-cases.html

Reproducible example

import numpy as np
np.random.seed(0)

from dask import compute, delayed
from dask.distributed import Client, LocalCluster

def raise_exception(x):
    if x == 10:
        raise ValueError("I'm an error on a worker")
    elif x == 20:
        print("I've made it to 20")
    else:
        return(x)


if __name__=="__main__":

    #Create cluster
    cluster = LocalCluster(n_workers=2,threads_per_worker=1)
    client = Client(cluster)

    values = [delayed(raise_exception)(x) for x in range(0,100)]
    results=compute(*values,scheduler='distributed')

Task 20 is never accomplished. The exception on task 10 causes the scheduler and workers to die. What am I not understanding about the programming model? Why does this count as gathering? I just want to run each task and capture any exceptions for later inspection, not raise them on the client.

Use Case

Parallel image processing on a University SLURM cluster. My function has a side-effect that saves processed images to file. The processes are independent and never gathered by the scheduler. The exception causes all nodes to die on the cluster.

Cross-listed on issues, since I'm not sure if this is a bug or a feature!

https://github.com/dask/distributed/issues/2436

回答1:

Answered in repo - dask delayed computes all-or-nothing. Use dask map from concurrent futures interface + wait. This was designed, not a bug.

https://github.com/dask/distributed/issues/2436

来源：https://stackoverflow.com/questions/53892940/how-to-prevent-dask-client-from-dying-on-worker-exception

标签

dask-distributed

How to prevent dask client from dying on worker exception?

问题

Problem

Expected Behavior

Reproducible example

Use Case

回答1: