dask-delayed

Dask scheduler empty / graph not showing

£可爱£侵袭症+ 提交于 2020-12-15 06:40:00
问题 I have a setup as follows: # etl.py from dask.distributed import Client import dask from tasks import task1, task2, task3 def runall(**kwargs): print("done") def etl(): client = Client() tasks = {} tasks['task1'] = dask.delayed(task)(*args) tasks['task2'] = dask.delayed(task)(*args) tasks['task3'] = dask.delayed(task)(*args) out = dask.delayed(runall)(**tasks) out.compute() This logic was borrowed from luigi and works nicely with if statements to control what tasks to run. However, some of

How do the batching instructions of Dask delayed best practices work?

白昼怎懂夜的黑 提交于 2020-12-15 06:16:59
问题 I guess I'm missing something (still a Dask Noob) but I'm trying the batching suggestion to avoid too many Dask tasks from here: https://docs.dask.org/en/latest/delayed-best-practices.html and can't make them work. This is what I tried: import dask def f(x): return x*x def batch(seq): sub_results = [] for x in seq: sub_results.append(f(x)) return sub_results batches = [] for i in range(0, 1000000000, 1000000): result_batch = dask.delayed(batch, range(i, i + 1000000)) batches.append(result

Can I use dask.delayed on a function wrapped with ctypes?

最后都变了- 提交于 2020-07-07 11:45:45
问题 The goal is to use dask.delayed to parallelize some 'embarrassingly parallel' sections of my code. The code involves calling a python function which wraps a c-function using ctypes . To understand the errors I was getting I wrote a very basic example. The c-function: double zippy_sum(double x, double y) { return x + y; } The python: from dask.distributed import Client client = Client(n_workers = 4) client import os import dask import ctypes current_dir = os.getcwd() #os.path.abspath(os.path

How can I get result of Dask compute on a different machine than the one that submitted it?

时光总嘲笑我的痴心妄想 提交于 2020-01-15 03:22:05
问题 I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here: https://github.com/MoonVision/django-dask-demo/blob/master/demo/daskmanager/daskmanager.py I want to be able to separate the saving of a task from the server that submitted it for robustness and scalability. I also would like more detailed information as to the processing status of the task, right now the future status

Dask For Loop In Parallel

主宰稳场 提交于 2020-01-11 08:47:07
问题 I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic. First, is this the correct way to run a for-loop in parallel? %%time list_names=['a','b','c','d'] keep_return=[] @delayed def loop_dummy(target): for i in range (1000000000): pass print('passed value is:'+target) return(1) for i in list_names: c=loop_dummy(i) keep_return.append(c) total = delayed(sum)(keep_return

How does dask.delayed handle mutable inputs?

怎甘沉沦 提交于 2020-01-05 04:24:09
问题 If I have an mutable object, let's say for example a dict, how does dask handle passing that as an input to delayed functions? Specifically if I make updates to the dict between delayed calls? I tried the following example which seems to suggest that some copying is going on but can you elaborate what exactly dask is doing? In [3]: from dask import delayed In [4]: x = {} In [5]: foo = delayed(print) In [6]: foo(x) Out[6]: Delayed('print-73930550-94a6-43f9-80ab-072bc88c2b88') In [7]: foo(x)

Using Dask compute causes execution to hang

大兔子大兔子 提交于 2019-12-12 04:57:15
问题 This is a follow up question to a potential answer to one of my previous questions on using Dask computed to access one element in a large array . Why does using Dask compute cause the execution to hang below? Here's the working code snippet: #Suppose you created a scheduler at the ip address of 111.111.11.11:8786 from dask.distributed import Client import dask.array as da # client1 client1 = Client("111.111.11.11:8786") x = da.ones(10000000, chunks=(100000,)) # 1e7 size array cut into 1e5

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

不打扰是莪最后的温柔 提交于 2019-12-11 08:04:02
问题 Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever machine is free.? does dask support such type of cluster.? what is the command that controls the task to run on a specific CPU/GPU machine.? 回答1: You can specify that a Dask worker has certain abstract resources dask-worker scheduler:8786 -

Access a single element in large published array with Dask

一个人想着一个人 提交于 2019-12-11 06:37:53
问题 Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array? In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1'). import distributed client = distributed.Client() data = [1]*10000000 payload = {'array1': data} client.publish(**payload) one_element = client.get_dataset('array1')[0] 回答1: Note that anything you publish goes to the scheduler, not to the workers, so

Creating a dask bag from a generator

孤者浪人 提交于 2019-12-11 00:23:42
问题 I would like to create a dask.Bag (or dask.Array ) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory. delayed_array = [delayed(generator) for generator in list_of_generators] my_bag = db.from_delayed(delayed_array) NB list_of_generators is exactly that - the generators haven't been consumed (yet). My problem is that when creating delayed_array the generators are consumed and RAM is exhausted. Is there a way to get these long lists into the