Shuffling data in dask

做~自己de王妃 提交于 2019-12-08 17:53:18

问题


This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.

The answer in that question was to do the following:

for part in df.repartition(npartitions=100).to_delayed():
    batch = part.compute()

However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.

What I would ideally like is something along the lines of:

rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]

which would work on pandas but not dask. Any thoughts?

Edit 1: Potential Solution

I tried doing

train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]

However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.


回答1:


I recommend adding a column of random data to your dataframe and then using that to set the index:

df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')



回答2:


If you're trying to separate your dataframe into training and testing subsets, it is what does sklearn.model_selection.train_test_split and it works with pandas.DataFrame. (Go there for an example)

And for your case of using it with dask, you may be interested by the library dklearn, that seems to implements this function.

To do that, we can use the train_test_split function, which mirrors the scikit-learn function of the same name. We'll hold back 20% of the rows:

from dklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)

More information here.

Note: I did not perform any test with dklearn, this is just a thing I came across, but I hope it can help.


EDIT: what about dask.DataFrame.random_split?

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)

Use for ML applications is illustrated here



来源:https://stackoverflow.com/questions/46842155/shuffling-data-in-dask

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!