dask-distributed | 易学教程

Override dask scheduler to concurrently load data on multiple workers

阅读更多关于 Override dask scheduler to concurrently load data on multiple workers

问题 I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this: from dask.distributed import Client client = Client(scheduler_ip) load_data_future = client.submit(load_data_func, 'path/to/data/') train_task_futures = [client.submit(train_func, load_data_future, params) for params in train_param_set] Running this as above the scheduler gets one worker to read the

Dask - How to concatenate Series into a DataFrame with apply?

阅读更多关于 Dask - How to concatenate Series into a DataFrame with apply?

问题 How do I return multiple values from a function applied on a Dask Series? I am trying to return a series from each iteration of dask.Series.apply and for the final result to be a dask.DataFrame . The following code tells me that the meta is wrong. The all-pandas version however works. What's wrong here? Update: I think that I am not specifying the meta/schema correctly. How do I do it correctly? Now it works when I drop the meta argument. However, it raises a warning. I would like to use dask

Best practices in setting number of dask workers

阅读更多关于 Best practices in setting number of dask workers

问题 I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster. The terms I came across are: thread, process, processor, node, worker, scheduler. My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example: 1 worker per node with n processes for the n cores on the node threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as

Workaround for Item assignment not supported in dask

阅读更多关于 Workaround for Item assignment not supported in dask

问题 I am trying to convert my code base from numpy array to dask because my numpy arrays are exceeding the Memory Error limit. But, I come to know that the feature of mutable arrays are not yet implemented in dask arrays so I am getting NotImplementedError: Item assignment with <class 'tuple'> not supported Is there any workaround for my code below- for i, mask in enumerate(masks): bounds = find_boundaries(mask, mode='inner') X2, Y2 = np.nonzero(bounds) X2 = da.from_array(X2, 'auto') Y2 = da.from

How to map a dask Series with a large dict

阅读更多关于 How to map a dask Series with a large dict

问题 I'm trying to figure out the best way to map a dask Series with a large mapping. The straightforward series.map(large_mapping) issues UserWarning: Large object of size <X> MB detected in task graph and suggests using client.scatter and client.submit but the latter doesn't solve the problem and in fact it's much slower. Trying broadcast=True in client.scatter doesn't help either. import argparse import distributed import dask.dataframe as dd import numpy as np import pandas as pd def compute(s

Using Dask compute causes execution to hang

阅读更多关于 Using Dask compute causes execution to hang

问题 This is a follow up question to a potential answer to one of my previous questions on using Dask computed to access one element in a large array . Why does using Dask compute cause the execution to hang below? Here's the working code snippet: #Suppose you created a scheduler at the ip address of 111.111.11.11:8786 from dask.distributed import Client import dask.array as da # client1 client1 = Client("111.111.11.11:8786") x = da.ones(10000000, chunks=(100000,)) # 1e7 size array cut into 1e5

Send SIGTERM to the running task, dask distributed

阅读更多关于 Send SIGTERM to the running task, dask distributed

问题 When I submit a small Tensorflow training as a single task, it launches additional threads. When I press Ctrl+C and raise KeyboardInterrupt my task is closed but underlying threads are not cleaned up and training continues. Initially, I was thinking that this is a problem of Tensorflow (not cleaning threads), but after testing, I understand that a problem comes from the Dask side, that probably doesn't populate SIGTERM signal further to the task function. My question, how can I set Dask to

Workaround for Item assignment not supported in dask

阅读更多关于 Workaround for Item assignment not supported in dask

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

阅读更多关于 Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

问题 Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever machine is free.? does dask support such type of cluster.? what is the command that controls the task to run on a specific CPU/GPU machine.? 回答1: You can specify that a Dask worker has certain abstract resources dask-worker scheduler:8786 -

Access a single element in large published array with Dask

阅读更多关于 Access a single element in large published array with Dask

问题 Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array? In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1'). import distributed client = distributed.Client() data = [1]*10000000 payload = {'array1': data} client.publish(**payload) one_element = client.get_dataset('array1')[0] 回答1: Note that anything you publish goes to the scheduler, not to the workers, so