问题
This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.
The answer in that question was to do the following:
for part in df.repartition(npartitions=100).to_delayed():
batch = part.compute()
However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.
What I would ideally like is something along the lines of:
rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]
which would work on pandas but not dask. Any thoughts?
Edit 1: Potential Solution
I tried doing
train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]
However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.
回答1:
I recommend adding a column of random data to your dataframe and then using that to set the index:
df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')
回答2:
If you're trying to separate your dataframe into training and testing subsets, it is what does sklearn.model_selection.train_test_split and it works with pandas.DataFrame. (Go there for an example)
And for your case of using it with dask, you may be interested by the library dklearn, that seems to implements this function.
To do that, we can use the train_test_split function, which mirrors the scikit-learn function of the same name. We'll hold back 20% of the rows:
from dklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
More information here.
Note: I did not perform any test with dklearn, this is just a thing I came across, but I hope it can help.
EDIT: what about dask.DataFrame.random_split?
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5])80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)
Use for ML applications is illustrated here
来源:https://stackoverflow.com/questions/46842155/shuffling-data-in-dask