dask | 易学教程

How to efficiently submit tasks with large arguments in Dask distributed?

阅读更多关于 How to efficiently submit tasks with large arguments in Dask distributed?

问题 I want to submit functions with Dask that have large (gigabyte scale) arguments. What is the best way to do this? I want to run this function many times with different (small) parameters. Example (bad) This uses the concurrent.futures interface. We could use the dask.delayed interface just as easily. x = np.random.random(size=100000000) # 800MB array params = list(range(100)) # 100 small parameters def f(x, param): pass from dask.distributed import Client c = Client() futures = [c.submit(f, x

dask dataframe read parquet schema difference

阅读更多关于 dask dataframe read parquet schema difference

问题 I do the following: import dask.dataframe as dd from dask.distributed import Client client = Client() raw_data_df = dd.read_csv('dataset/nyctaxi/nyctaxi/*.csv', assume_missing=True, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']) The dataset is taken out of a presentation Mathew Rocklin has made and was used as a dask dataframe demo. Then I try to write it to parquet using pyarrow raw_data_df.to_parquet(path='dataset/parquet/2015.parquet/') # only pyarrow is installed Trying to

python-xarray: open_mfdataset concat along two dimensions

阅读更多关于 python-xarray: open_mfdataset concat along two dimensions

问题 I have files which are made of 10 ensembles and 35 time files. One of these files looks like: >>> xr.open_dataset('ens1/CCSM4_ens1_07ic_19820701-19820731_NPac_Jul.nc') <xarray.Dataset> Dimensions: (ensemble: 1, latitude: 66, longitude: 191, time: 31) Coordinates: * ensemble (ensemble) int32 1 * latitude (latitude) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 ... * longitude (longitude) float32 100.0 101.0 102.0 103.0 104.0 105.0 106.0 ... * time (time) datetime64[ns] 1982-07-01 1982-07-02

What is the role of npartitions in a Dask dataframe?

阅读更多关于 What is the role of npartitions in a Dask dataframe?

问题 I see the paramter npartitions in many functions, but I don't understand what it is good for / used for. http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv head(...) Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions. repartition(...) Number of partitions of output, must be less than npartitions of input.

Repeated task execution using the distributed Dask scheduler

阅读更多关于 Repeated task execution using the distributed Dask scheduler

问题 I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute() . When the number of tasks is say 20 (a number >> than the number of workers) and each task takes say at least 15 secs, the scheduler starts rerunning some of the tasks (or executes them in parallel more than once) . This is a problem since the tasks modify a SQL db and if they run again they end up raising an Exception (due to DB uniqueness constraints). I'm

Getting year and week from a datetime series in a dask dataframe?

阅读更多关于 Getting year and week from a datetime series in a dask dataframe?

问题 If I have a Pandas dataframe, and a column that is a datetime type, I can get the year as follows: df['year'] = df['date'].dt.year With a dask dataframe, that does not work. If I compute first, like this: df['year'] = df['date'].compute().dt.year I get ValueError: Not all divisions are known, can't align partitions. Please use set_index or set_partition to set the index. But if I do: df['date'].head().dt.year it works fine! So how do I get the year (or week) of a datetime series in a dask

column with dates into datetime index in dask

阅读更多关于 column with dates into datetime index in dask

问题 pd.DatetimeIndex(df_dask_dataframe['name_col']) I have a dask dataframe for which I want to convert a column with dates into datetime index. However I get a not implemented error. Is there a workaround? 回答1: I think you need dask.dataframe.DataFrame.set_index if dtype of column is datetime64 : df_dask_dataframe = df_dask_dataframe.set_index('name_col') 来源： https://stackoverflow.com/questions/44992508/column-with-dates-into-datetime-index-in-dask

Understanding the process of loading multiple file contents into Dask Array and how it scales

阅读更多关于 Understanding the process of loading multiple file contents into Dask Array and how it scales

问题 Using the example on http://dask.pydata.org/en/latest/array-creation.html filenames = sorted(glob('2015-*-*.hdf5') dsets = [h5py.File(fn)['/data'] for fn in filenames] arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets] x = da.concatenate(arrays, axis=0) # Concatenate arrays along first axis I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the

Dask DataFrame: Resample over groupby object with multiple rows

阅读更多关于 Dask DataFrame: Resample over groupby object with multiple rows

问题 I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 9235 2015-08-08 01:10:00 a 2015-08-08 02:20:00 2353 2015-08-08 02:20:00 b 2015-08-08 02:20:00 9235 2015-08-08 02:20:00 c 2015-08-08 04:10:00 9235 2015-08-08 04:10:00 d 2015-08-08 08:10:00 2353 2015-08-08 08:10:00 e What I'm trying to do is: Group by user_id and ts Resample it over a 3-hour

Memory error with dask array

阅读更多关于 Memory error with dask array

问题 I am implementing Neural Network whose input and output matrices are very large, so I am using dask arrays for storing them. X is input matrix of 32000 x 7500 and y is output matrix of same dimension. Below is neural network code having 1 hidden layer: class Neural_Network(object): def __init__(self,i,j,k): #define hyperparameters self.inputLayerSize = i self.outputLayerSize = j self.hiddenLayerSize = k #weights self.W1 = da.random.normal(0.5,0.5,size =(self.inputLayerSize,self