dask-distributed

Parameter search using dask

折月煮酒 提交于 2019-12-11 05:29:46
问题 How optimally search parameter space using Dask? (no cross validation) Here is the code (no DASK here): def build(ntries,param,niter,func,score,train,test): res=[] for i in range(ntries): cparam=param.rvs(size=niter,random_state=i) res.append( func(cparam, train, test, score) ) return res def score(test,correct): return np.linalg.norm(test-correct) def compute_optimal(res): from operator import itemgetter _sorted=sorted(res,None,itemgetter(1)) return _sorted def func(c,train,test,score): dt=1

Initializing state on dask-distributed workers

孤人 提交于 2019-12-11 04:06:50
问题 I am trying to do something like resource = MyResource() def fn(x): something = dosemthing(x, resource) return something client = Client() results = client.map(fn, data) The issue is that resource is not serializable and is expensive to construct. Therefore I would like to construct it once on each worker and be available to be used by fn . How do I do this? Or is there some other way to make resource available on all workers? 回答1: You can always construct a lazy resource, something like

Dask Distributed client takes to long to initialize in jupyter lab

妖精的绣舞 提交于 2019-12-11 02:46:21
问题 Trying to initialize a client with local cluster in Jupyter lab but hangs. This behaviour happens for python 3.5 and jupyter lab 0.35. import dask.dataframe as dd from dask import delayed from distributed import Client from distributed import LocalCluster import pandas as pd import numpy as np import json cluster = LocalCluster() client = Cluster(cluster) client versions of the tools : Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright",

Generating batches of images in dask

心已入冬 提交于 2019-12-10 22:48:09
问题 I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF . I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this: img_path labels 0 data/1.JPG 1 1 data/2.JPG 1 2 data/3.JPG 5 ... Now here is my simple task: Use dask to read images and corresponding labels in a lazy fashion. Do some processing on

What is the “right” way to close a Dask LocalCluster?

蓝咒 提交于 2019-12-10 03:17:17
问题 I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg backend). For example, if I close both the client and the cluster in this order then tk can not remove in an appropriate way the image from the memory and I get the following error: Traceback (most recent call last): File "/opt/Python-3.6.0/lib/python3.6

ValueError: Not all divisions are known, can't align partitions error on dask dataframe

自作多情 提交于 2019-12-08 21:45:48
问题 I have the following pandas dataframe with the following columns user_id user_agent_id requests All columns contain integers. I wan't to perform some operations on them and run them using dask dataframe. This is what I do. user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \ .groupby(['user_id', 'user_agent_id']) \ .size().to_frame(name='appearances') \ .reset_index() # I am not sure I can run this on dask dataframe user_profile_ddf = df.from_pandas(user_profile,

Does Dask communicate with HDFS to optimize for data locality?

◇◆丶佛笑我妖孽 提交于 2019-12-08 05:06:49
问题 In Dask distributed documentation, they have the following information: For example Dask developers use this ability to build in data locality when we communicate to data-local storage systems like the Hadoop File System. When users use high-level functions like dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to the HDFS name node, finds the locations of all of the blocks of data, and sends that information to the scheduler so that it can make smarter decisions and improve

File Not Found Error in Dask program run on cluster

与世无争的帅哥 提交于 2019-12-08 04:51:12
问题 I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found 回答1: When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways: copy the file to every worker; this is obviously wasteful in

File Not Found Error in Dask program run on cluster

蹲街弑〆低调 提交于 2019-12-06 15:12:20
I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways: copy the file to every worker; this is obviously wasteful in terms of disc space, but the easiest to achieve place the file on a networked filesystem (NFS mount, gluster,

how do we choose --nthreads and --nprocs per worker in dask distributed?

早过忘川 提交于 2019-12-05 04:56:18
how do we choose --nthreads and --nprocs per worker in Dask distributed? i have 3 workers , with 4 cores and one thread per core on 2 workers and 8 cores on 1 worker (according to the output of 'lscpu' Linux command on each worker) It depends on your workload By default Dask creates a single process with as many threads as you have logical cores on your machine (as determined by multiprocessing.cpu_count() ). dask-worker ... --nprocs 1 --nthreads 8 # assuming you have eight cores dask-worker ... # this is actually the default setting Using few processes and many threads per process is good if