dask | 易学教程

thread.lock during custom parameter search class using Dask distributed

阅读更多关于 thread.lock during custom parameter search class using Dask distributed

问题 I wrote my own parameter search implementation mostly due to the fact that I don't need cross-validation of GridSearch and RandomizedSearch of scikit-learn . I use dask to deliver optimal distributed performance. Here is what I have: from scipy.stats import uniform class Params(object): def __init__(self,fixed,loc=0.0,scale=1.0): self.fixed=fixed self.sched=uniform(loc=loc,scale=scale) def _getsched(self,i,size): return self.sched.rvs(size=size,random_state=i) def param(self,i,size=None): tmp

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

阅读更多关于 Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

问题 Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever machine is free.? does dask support such type of cluster.? what is the command that controls the task to run on a specific CPU/GPU machine.? 回答1: You can specify that a Dask worker has certain abstract resources dask-worker scheduler:8786 -

Reshape a dask array (obtained from a dask dataframe column)

阅读更多关于 Reshape a dask array (obtained from a dask dataframe column)

问题 I am new to dask and am trying to figure out how to reshape a dask array that I've obtained from a single column of a dask dataframe and am running into errors. Wondering if anyone might know of the fix (without having to force a compute)? Thanks! Example: import pandas as pd import numpy as np from dask import dataframe as dd, array as da df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) ddf = dd.from_pandas(df, npartitions=2) # This does not work - error ValueError: cannot convert float

Dask: Jobs on multiple nodes with one worker, run on one node only

阅读更多关于 Dask: Jobs on multiple nodes with one worker, run on one node only

问题 I am trying to process some files using a python function and would like to parallelize the task on a PBS cluster using dask. On the cluster I can only launch one job but have access to 10 nodes with 24 cores each. So my dask PBSCluster looks like: import dask from dask_jobqueue import PBSCluster cluster = PBSCluster(cores=240, memory="1GB", project='X', queue='normal', local_directory='$TMPDIR', walltime='12:00:00', resource_spec='select=10:ncpus=24:mem=1GB', ) cluster.scale(1) # one worker

Add a unique identifier in a new column until a condition met on another column

阅读更多关于 Add a unique identifier in a new column until a condition met on another column

问题 I have a dask dataframe with npartition=8, here is the snapshot of the data: id1 id2 Page_nbr record_type St1 Sc1 3 START Sc1 St1 5 ADD Sc1 St1 9 OTHER Sc2 St2 34 START Sc2 St2 45 DURATION Sc2 St2 65 END Sc3 Sc3 4 START I want to add a column after record_type and add a unique group_id based on the condition of record type, so till the next record_type=START add the same unique group_id, output will look like below: id1 id2 Page_nbr record_type group_id St1 Sc1 3 START 1 Sc1 St1 5 ADD 1 Sc1

Access a single element in large published array with Dask

阅读更多关于 Access a single element in large published array with Dask

问题 Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array? In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1'). import distributed client = distributed.Client() data = [1]*10000000 payload = {'array1': data} client.publish(**payload) one_element = client.get_dataset('array1')[0] 回答1: Note that anything you publish goes to the scheduler, not to the workers, so

Parameter search using dask

阅读更多关于 Parameter search using dask

问题 How optimally search parameter space using Dask? (no cross validation) Here is the code (no DASK here): def build(ntries,param,niter,func,score,train,test): res=[] for i in range(ntries): cparam=param.rvs(size=niter,random_state=i) res.append( func(cparam, train, test, score) ) return res def score(test,correct): return np.linalg.norm(test-correct) def compute_optimal(res): from operator import itemgetter _sorted=sorted(res,None,itemgetter(1)) return _sorted def func(c,train,test,score): dt=1

Avoid simultaneously reading multiple files for a dask array

阅读更多关于 Avoid simultaneously reading multiple files for a dask array

问题 From a library, I get a function that reads a file and returns a numpy array. I want to build a Dask array with multiple blocks from multiple files. Each block is the result of calling the function on a file. When I ask Dask to compute, will Dask asks the functions to read multiple files from the hard disk at the same time? If that is the case, how to avoid that? My computer doesn't have a parallel file system. Example: import numpy as np import dask.array as da import dask # Make test data n

Initializing state on dask-distributed workers

阅读更多关于 Initializing state on dask-distributed workers

问题 I am trying to do something like resource = MyResource() def fn(x): something = dosemthing(x, resource) return something client = Client() results = client.map(fn, data) The issue is that resource is not serializable and is expensive to construct. Therefore I would like to construct it once on each worker and be available to be used by fn . How do I do this? Or is there some other way to make resource available on all workers? 回答1: You can always construct a lazy resource, something like

Update of dask's dataframe

阅读更多关于 Update of dask's dataframe

问题 I'm new to dask so could you help me please? I have a csv-file like this: id,popularity,hashtag,seen 0,100,#footbal,0 1,200,#2017,0 2,300,#1,0 and somehow i managed to get a dask dataframe hashtags_to_update : id seen 0 118 2 136 I'd like to merge a data from hashtags_to_update with data from csv-file to get: id,popularity,hashtag,seen 0,100,#footbal,118 1,200,#2017,0 2,300,#1,136 For now I'm doing the following hashtags_df = dd.read_csv('path/to/csv/file').set_index('id') hashtags_df["seen"]