dask | 易学教程

Dask broadcast not available during compute graph

阅读更多关于 Dask broadcast not available during compute graph

问题 I am experimenting with Dask and want to ship a lookup pandas.DataFrame to all worker nodes. Unfortunately, it fails with: TypeError: ("'Future' object is not subscriptable", 'occurred at index 0') When instead of lookup['baz'].iloc[2] using lookup.result()['foo'].iloc[2] , it works fine but: for larger instances of the input dataframe, it seems to be stuck at from_pandas again and again. Also, it seems strange that the future needs to be blocked manually (over and over again for each row in

How to use Dask to run python code on the GPU?

阅读更多关于 How to use Dask to run python code on the GPU?

问题 I have some code that uses Numba cuda.jit in order for me to run on the gpu, and I would like to layer dask on top of it if possible. Example Code #!/usr/bin/env python3 # -*- coding: utf-8 -*- from numba import cuda, njit import numpy as np from dask.distributed import Client, LocalCluster @cuda.jit() def addingNumbersCUDA (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j]

randomly mask/set nan x% of data points in huge xarray.DataArray

阅读更多关于 randomly mask/set nan x% of data points in huge xarray.DataArray

问题 I have a huge (~ 2 billion data points) xarray.DataArray . I would like to randomly delete (either mask or replace by np.nan ) a given percentage of the data, where the probability for every data point to be chosen for deletion/masking is the same across all coordinates. I can convert the array to a numpy.array but I would preferably keep it in the dask chunks for speed. my data looks like this: >> data <xarray.DataArray 'stack-820860ba63bd07adc355885d96354267' (variable: 8, time: 228,

Filtering grouped df in Dask

阅读更多关于 Filtering grouped df in Dask

问题 Related to this similar question for Pandas: filtering grouped df in pandas Action To eliminate groups based on an expression applied to a different column than the groupby column. Problem Filter is not implemented for grouped dataframes. Tried Groupby and apply to eliminate certain groups, which returns an index error because the apply function is supposed to always return something? In [16]: def filter_empty(df): if not df.label.values.all(4): return df df_nonempty = df_norm.groupby('hash')

How to improve function using pd.groupby.transform in a dask environment

阅读更多关于 How to improve function using pd.groupby.transform in a dask environment

问题 We need to create groups based on a time sequence. We are working with dask but for this function we need to move back to pandas since transform is not yet implemented in dask . Although the function works - is there anyway to improve the performance? (We are running our code on a local Client and sometimes on a yarn-client ) Bellow is our function and a minimal, complete and verifiable example: import pandas as pd import numpy as np import random import dask import dask.dataframe as dd from

Override dask scheduler to concurrently load data on multiple workers

阅读更多关于 Override dask scheduler to concurrently load data on multiple workers

问题 I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this: from dask.distributed import Client client = Client(scheduler_ip) load_data_future = client.submit(load_data_func, 'path/to/data/') train_task_futures = [client.submit(train_func, load_data_future, params) for params in train_param_set] Running this as above the scheduler gets one worker to read the

How to find why a task fails in dask distributed?

阅读更多关于 How to find why a task fails in dask distributed?

问题 I am developing a distributed computing system using dask.distributed . Tasks that I submit to it with the Executor.map function sometimes fail, while others seeming identical, run successfully. Does the framework provide any means to diagnose problems? update By failing I mean increasing counter of failed tasks in the Bokeh web UI, provided by the scheduler. Counter of finished tasks increases too. Function that is run by the Executor.map returns None . It communicates to a database,

Drop column using Dask dataframe

阅读更多关于 Drop column using Dask dataframe

问题 This should work: raw_data.drop('some_great_column', axis=1).compute( ) But the column is not dropped. In pandas I use: raw_data.drop(['some_great_column'], axis=1, inplace=True) But inplace does not exist in Dask. Any ideas? 回答1: You can separate into two operations: # dask operation raw_data = raw_data.drop('some_great_column', axis=1) # conversion to pandas df = raw_data.compute() Then export the Pandas dataframe to a CSV file: df.to_csv(r'out.csv', index=False) 来源： https://stackoverflow

Appending new column to dask dataframe

阅读更多关于 Appending new column to dask dataframe

问题 This is a follow up question to Shuffling data in dask. I have an existing dask dataframe df where I wish to do the following: df['rand_index'] = np.random.permutation(len(df)) However, this gives the error, Column assignment doesn't support type ndarray . I tried to use df.assign(rand_index = np.random.permutation(len(df)) which gives the same error. Here is a minimal (not) working sample: import pandas as pd import dask.dataframe as dd import numpy as np df = dd.from_pandas(pd.DataFrame({'A

Including keyword arguments (kwargs) in custom Dask graphs

阅读更多关于 Including keyword arguments (kwargs) in custom Dask graphs

问题 Am building a custom graph for one operation with Dask. Am familiar with how to pass arguments to a function in Dask graph and have read up on the docs. However still seem to be missing something. One of the functions used in the Dask graph takes keyword arguments. Though am confused as to how the keyword arguments can be passed to it. Some of these keyword arguments represent Dask objects so they must be in the graph explicitly (i.e. functools.partial won't work). Can see the following