dask

How to efficiently submit tasks with large arguments in Dask distributed?

≯℡__Kan透↙ 提交于 2019-12-18 07:30:20
问题 I want to submit functions with Dask that have large (gigabyte scale) arguments. What is the best way to do this? I want to run this function many times with different (small) parameters. Example (bad) This uses the concurrent.futures interface. We could use the dask.delayed interface just as easily. x = np.random.random(size=100000000) # 800MB array params = list(range(100)) # 100 small parameters def f(x, param): pass from dask.distributed import Client c = Client() futures = [c.submit(f, x

dask dataframe read parquet schema difference

元气小坏坏 提交于 2019-12-18 07:09:28
问题 I do the following: import dask.dataframe as dd from dask.distributed import Client client = Client() raw_data_df = dd.read_csv('dataset/nyctaxi/nyctaxi/*.csv', assume_missing=True, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']) The dataset is taken out of a presentation Mathew Rocklin has made and was used as a dask dataframe demo. Then I try to write it to parquet using pyarrow raw_data_df.to_parquet(path='dataset/parquet/2015.parquet/') # only pyarrow is installed Trying to

python-xarray: open_mfdataset concat along two dimensions

こ雲淡風輕ζ 提交于 2019-12-18 04:13:07
问题 I have files which are made of 10 ensembles and 35 time files. One of these files looks like: >>> xr.open_dataset('ens1/CCSM4_ens1_07ic_19820701-19820731_NPac_Jul.nc') <xarray.Dataset> Dimensions: (ensemble: 1, latitude: 66, longitude: 191, time: 31) Coordinates: * ensemble (ensemble) int32 1 * latitude (latitude) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 ... * longitude (longitude) float32 100.0 101.0 102.0 103.0 104.0 105.0 106.0 ... * time (time) datetime64[ns] 1982-07-01 1982-07-02

What is the role of npartitions in a Dask dataframe?

做~自己de王妃 提交于 2019-12-14 03:40:47
问题 I see the paramter npartitions in many functions, but I don't understand what it is good for / used for. http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv head(...) Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions. repartition(...) Number of partitions of output, must be less than npartitions of input.

Repeated task execution using the distributed Dask scheduler

痴心易碎 提交于 2019-12-14 01:22:26
问题 I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute() . When the number of tasks is say 20 (a number >> than the number of workers) and each task takes say at least 15 secs, the scheduler starts rerunning some of the tasks (or executes them in parallel more than once) . This is a problem since the tasks modify a SQL db and if they run again they end up raising an Exception (due to DB uniqueness constraints). I'm

Getting year and week from a datetime series in a dask dataframe?

旧街凉风 提交于 2019-12-14 00:34:57
问题 If I have a Pandas dataframe, and a column that is a datetime type, I can get the year as follows: df['year'] = df['date'].dt.year With a dask dataframe, that does not work. If I compute first, like this: df['year'] = df['date'].compute().dt.year I get ValueError: Not all divisions are known, can't align partitions. Please use set_index or set_partition to set the index. But if I do: df['date'].head().dt.year it works fine! So how do I get the year (or week) of a datetime series in a dask

column with dates into datetime index in dask

六月ゝ 毕业季﹏ 提交于 2019-12-13 17:33:18
问题 pd.DatetimeIndex(df_dask_dataframe['name_col']) I have a dask dataframe for which I want to convert a column with dates into datetime index. However I get a not implemented error. Is there a workaround? 回答1: I think you need dask.dataframe.DataFrame.set_index if dtype of column is datetime64 : df_dask_dataframe = df_dask_dataframe.set_index('name_col') 来源: https://stackoverflow.com/questions/44992508/column-with-dates-into-datetime-index-in-dask

Understanding the process of loading multiple file contents into Dask Array and how it scales

浪尽此生 提交于 2019-12-13 17:12:22
问题 Using the example on http://dask.pydata.org/en/latest/array-creation.html filenames = sorted(glob('2015-*-*.hdf5') dsets = [h5py.File(fn)['/data'] for fn in filenames] arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets] x = da.concatenate(arrays, axis=0) # Concatenate arrays along first axis I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the

Dask DataFrame: Resample over groupby object with multiple rows

痴心易碎 提交于 2019-12-13 12:11:59
问题 I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 9235 2015-08-08 01:10:00 a 2015-08-08 02:20:00 2353 2015-08-08 02:20:00 b 2015-08-08 02:20:00 9235 2015-08-08 02:20:00 c 2015-08-08 04:10:00 9235 2015-08-08 04:10:00 d 2015-08-08 08:10:00 2353 2015-08-08 08:10:00 e What I'm trying to do is: Group by user_id and ts Resample it over a 3-hour

Memory error with dask array

ⅰ亾dé卋堺 提交于 2019-12-13 07:03:46
问题 I am implementing Neural Network whose input and output matrices are very large, so I am using dask arrays for storing them. X is input matrix of 32000 x 7500 and y is output matrix of same dimension. Below is neural network code having 1 hidden layer: class Neural_Network(object): def __init__(self,i,j,k): #define hyperparameters self.inputLayerSize = i self.outputLayerSize = j self.hiddenLayerSize = k #weights self.W1 = da.random.normal(0.5,0.5,size =(self.inputLayerSize,self