Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'

橙三吉。 提交于 2020-05-14 18:15:08

问题


I have a dask dataframe created from a csv file and len(daskdf) returns 18000 but when I ddSample = daskdf.sample(2000) I get the error

ValueError: Cannot take a larger sample than population when 'replace=False'

Can I sample without replacement if the dataframe is larger than the sample size?


回答1:


The sample method only supports the frac= keyword argument. See the API documentation

The error that you're getting is from Pandas, not Dask.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1]})
In [3]: df.sample(frac=2000, replace=False)
ValueError: Cannot take a larger sample than population when 'replace=False'

Solution

As the Pandas error suggests, consider sampling with replacement

In [4]: df.sample(frac=2, replace=True)
Out[4]: 
   x
0  1
0  1

In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=1)
In [7]: ddf.sample(frac=2, replace=True).compute()
Out[7]: 
   x
0  1
0  1


来源:https://stackoverflow.com/questions/39175963/sampling-n-2000-from-a-dask-dataframe-of-len-18000-generates-error-cannot-take

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!