What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead.
Would the equivalent be :
ddf.set_index([col1, col2], sorted=True) ?
What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead.
Would the equivalent be :
ddf.set_index([col1, col2], sorted=True) ?
Sorting in parallel is hard. You have two options in Dask.dataframe
As now, you can call set_index with a single column index:
In [1]: import pandas as pd In [2]: import dask.dataframe as dd In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']}) In [4]: ddf = dd.from_pandas(df, npartitions=2) In [5]: ddf.set_index('x').compute() Out[5]: y x 1 c 2 b 3 a Unfortunately dask.dataframe does not (as of November 2016) support multi-column indexes In [6]: ddf.set_index(['x', 'y']).compute() NotImplementedError: Dask dataframe does not yet support multi-indexes. You tried to index with this index: ['x', 'y'] Indexes must be single columns only. Given how you phrased your question I suspect that this doesn't apply to you, but often cases that use sorting can get by with the much cheaper solution nlargest.
In [7]: ddf.x.nlargest(2).compute() Out[7]: 0 3 1 2 Name: x, dtype: int64 In [8]: ddf.nlargest(2, 'x').compute() Out[8]: x y 0 3 a 1 2 b