I want to select rows from a dask dataframe based on a list of indices. How can I do that?
Example: Let\'s say, I have the following dask dataframe. <
Edit: dask now supports loc on lists:
ddf_selected = ddf.loc[indices_i_want_to_select]
The following should still work, but is not necessary anymore:
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)
#list of indices I want to select
l = ['i1', 4, 5]
#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
Using dask version '1.2.0' results with an error due to the mixed index type.
in any case there is an option to use loc.
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)
# #list of indices I want to select
l = ['i1', '4', '5']
# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()