dask-distributed | 易学教程

Long running workers blocking GIL timeout errors

阅读更多关于 Long running workers blocking GIL timeout errors

问题 I'm using dask-distributed with a local setup (LocalCluster with 5 workers) on a dask.delayed workload. Most of the work is done by the vtk Python bindings. Since vtk is C++ based I think that means the workers don't release the GIL when in a long-running statement. When I run the workload, my terminal prints out a bunch of errors like this: Traceback (most recent call last): File "C:\Users\patri\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\comm\core.py", line 221, in

Actors and dask-workers

阅读更多关于 Actors and dask-workers

问题 client = Client('127.0.0.1:8786',direct_to_workers=True) future1 = client.submit(Counter, workers= 'ninja',actor=True) counter1 = future1.result() print(counter1) All is well but what if the client gets restarted? How do I get the actor back from the worker called ninja? 回答1: There is no user-facing way to do this as of 2019-03-06 I recommend raising a feature request issue 来源： https://stackoverflow.com/questions/54918699/actors-and-dask-workers

Actors and dask-workers

阅读更多关于 Actors and dask-workers

Why do my Dask Futures get stuck in 'pending' and never finish?

阅读更多关于 Why do my Dask Futures get stuck in 'pending' and never finish?

问题 I have some long-running code (~5-10 minute processing) that I'm trying to run as a Dask Future . It's a series of several discrete steps that I can either run as one function: result : Future = client.submit(my_function, arg1, arg2) Or I can split up into intermediate steps: # compose the result from the same intermediate results but with Futures intermediate1 = client.submit(my_function1, arg1) intermediate2 = client.submit(my_function2, arg1, arg2) intermediate3 = client.submit(my

Why do my Dask Futures get stuck in 'pending' and never finish?

阅读更多关于 Why do my Dask Futures get stuck in 'pending' and never finish?

How to assign tasks to specific worker within Dask.Distributed

阅读更多关于 How to assign tasks to specific worker within Dask.Distributed

问题 I am interesting in using Dask Distributed as task executor. In Celery it is possible to assign task to specific worker. How is it possible using Dask Distributed? 回答1: There are 2 options: Specify workers by name or host or IP (but only positive declarations): dask-worker scheduler_address:8786 --name worker_1 and then one of option: client.map(func, sequence, workers='worker_1') client.map(func, sequence, workers=['192.168.1.100', '192.168.1.100:8989', 'alice', 'alice:8989']) client.submit

Difference between dask.distributed LocalCluster with threads vs. processes

阅读更多关于 Difference between dask.distributed LocalCluster with threads vs. processes

问题 What is the difference between the following LocalCluster configurations for dask.distributed ? Client(n_workers=4, processes=False, threads_per_worker=1) versus Client(n_workers=1, processes=True, threads_per_worker=4) They both have four threads working on the task graph, but the first has four workers. What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads? Edit : just a clarification, I'm aware of the difference

Dask dataframe split partitions based on a column or function

阅读更多关于 Dask dataframe split partitions based on a column or function

问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Dask dataframe split partitions based on a column or function

阅读更多关于 Dask dataframe split partitions based on a column or function

Parallelization on cluster dask

阅读更多关于 Parallelization on cluster dask

问题 I'm looking for the best way to parallelize on a cluster the following problem. I have several files folder/file001.csv folder/file002.csv : folder/file100.csv They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files. In one side I can just run df = dd.read_csv("folder/*") df.groupby("key").apply(f, meta=meta).compute(scheduler='processes') But I'm wondering if there is a better/smarter way