dask-distributed

Long running workers blocking GIL timeout errors

耗尽温柔 提交于 2021-02-18 18:55:47
问题 I'm using dask-distributed with a local setup (LocalCluster with 5 workers) on a dask.delayed workload. Most of the work is done by the vtk Python bindings. Since vtk is C++ based I think that means the workers don't release the GIL when in a long-running statement. When I run the workload, my terminal prints out a bunch of errors like this: Traceback (most recent call last): File "C:\Users\patri\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\comm\core.py", line 221, in

Actors and dask-workers

眉间皱痕 提交于 2021-02-09 08:29:41
问题 client = Client('127.0.0.1:8786',direct_to_workers=True) future1 = client.submit(Counter, workers= 'ninja',actor=True) counter1 = future1.result() print(counter1) All is well but what if the client gets restarted? How do I get the actor back from the worker called ninja? 回答1: There is no user-facing way to do this as of 2019-03-06 I recommend raising a feature request issue 来源: https://stackoverflow.com/questions/54918699/actors-and-dask-workers

Actors and dask-workers

让人想犯罪 __ 提交于 2021-02-09 08:28:47
问题 client = Client('127.0.0.1:8786',direct_to_workers=True) future1 = client.submit(Counter, workers= 'ninja',actor=True) counter1 = future1.result() print(counter1) All is well but what if the client gets restarted? How do I get the actor back from the worker called ninja? 回答1: There is no user-facing way to do this as of 2019-03-06 I recommend raising a feature request issue 来源: https://stackoverflow.com/questions/54918699/actors-and-dask-workers

Why do my Dask Futures get stuck in 'pending' and never finish?

假装没事ソ 提交于 2021-02-08 08:25:25
问题 I have some long-running code (~5-10 minute processing) that I'm trying to run as a Dask Future . It's a series of several discrete steps that I can either run as one function: result : Future = client.submit(my_function, arg1, arg2) Or I can split up into intermediate steps: # compose the result from the same intermediate results but with Futures intermediate1 = client.submit(my_function1, arg1) intermediate2 = client.submit(my_function2, arg1, arg2) intermediate3 = client.submit(my

Why do my Dask Futures get stuck in 'pending' and never finish?

大城市里の小女人 提交于 2021-02-08 08:25:04
问题 I have some long-running code (~5-10 minute processing) that I'm trying to run as a Dask Future . It's a series of several discrete steps that I can either run as one function: result : Future = client.submit(my_function, arg1, arg2) Or I can split up into intermediate steps: # compose the result from the same intermediate results but with Futures intermediate1 = client.submit(my_function1, arg1) intermediate2 = client.submit(my_function2, arg1, arg2) intermediate3 = client.submit(my

How to assign tasks to specific worker within Dask.Distributed

半腔热情 提交于 2021-02-07 07:52:46
问题 I am interesting in using Dask Distributed as task executor. In Celery it is possible to assign task to specific worker. How is it possible using Dask Distributed? 回答1: There are 2 options: Specify workers by name or host or IP (but only positive declarations): dask-worker scheduler_address:8786 --name worker_1 and then one of option: client.map(func, sequence, workers='worker_1') client.map(func, sequence, workers=['192.168.1.100', '192.168.1.100:8989', 'alice', 'alice:8989']) client.submit

Difference between dask.distributed LocalCluster with threads vs. processes

旧巷老猫 提交于 2021-02-07 06:46:10
问题 What is the difference between the following LocalCluster configurations for dask.distributed ? Client(n_workers=4, processes=False, threads_per_worker=1) versus Client(n_workers=1, processes=True, threads_per_worker=4) They both have four threads working on the task graph, but the first has four workers. What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads? Edit : just a clarification, I'm aware of the difference

Dask dataframe split partitions based on a column or function

做~自己de王妃 提交于 2021-02-06 20:48:40
问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Dask dataframe split partitions based on a column or function

妖精的绣舞 提交于 2021-02-06 20:45:47
问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Parallelization on cluster dask

纵饮孤独 提交于 2021-01-29 10:28:50
问题 I'm looking for the best way to parallelize on a cluster the following problem. I have several files folder/file001.csv folder/file002.csv : folder/file100.csv They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files. In one side I can just run df = dd.read_csv("folder/*") df.groupby("key").apply(f, meta=meta).compute(scheduler='processes') But I'm wondering if there is a better/smarter way