dask DataFrame equivalent of pandas DataFrame sort_values

匿名 (未验证) 提交于 2019-12-03 08:59:04

问题:

What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead.

Would the equivalent be :

ddf.set_index([col1, col2], sorted=True)

?

回答1:

Sorting in parallel is hard. You have two options in Dask.dataframe

set_index

As now, you can call set_index with a single column index:

In [1]: import pandas as pd  In [2]: import dask.dataframe as dd  In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']})  In [4]: ddf = dd.from_pandas(df, npartitions=2)  In [5]: ddf.set_index('x').compute() Out[5]:     y x    1  c 2  b 3  a  Unfortunately dask.dataframe does not (as of November 2016) support multi-column indexes  In [6]: ddf.set_index(['x', 'y']).compute() NotImplementedError: Dask dataframe does not yet support multi-indexes. You tried to index with this index: ['x', 'y'] Indexes must be single columns only.

nlargest

Given how you phrased your question I suspect that this doesn't apply to you, but often cases that use sorting can get by with the much cheaper solution nlargest.

In [7]: ddf.x.nlargest(2).compute() Out[7]:  0    3 1    2 Name: x, dtype: int64  In [8]: ddf.nlargest(2, 'x').compute() Out[8]:     x  y 0  3  a 1  2  b


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!