Slicing out a few rows from a `dask.DataFrame`

Often, when working with a large dask.DataFrame, it would be useful to grab only a few rows on which to test all subsequent operations.

Currently, according to Slicing a Dask Dataframe, this is unsupported.

I was hoping to then use head to achieve the same (since that command is supported), but that returns a regular pandas DataFrame.
I also tried df[:1000], which executes, but generates an output different from that you'd expect from Pandas.

Is there any way to grab the first 1000 rows from a dask.DataFrame?

If your dataframe has a sensibly partitioned index then I recommend using .loc

small = big.loc['2000':'2005']

If you want to maintain the same number of partitions, you might consider sample

small = big.sample(frac=0.01)

If you just want a single partition, you might try get_partition

small = big.get_partition(0)

You can also, always use to_delayed and from_delayed to build your own custom solution. http://dask.pydata.org/en/latest/dataframe-create.html#dask-delayed

More generally, Dask.dataframe doesn't keep row-counts per partition, so the specific question of "give me 1000 rows" ends up being surprisingly hard to answer. It's a lot easier to answer questions like "give me all the data in January" or "give me the first partition"

You may repartition your initial DataFrame into an arbitrary number of partitions. If you want slices of 1000 rows :

npart = round(len(df)/1000)
parted_df = df.repartition(npartitions=npart)

Then just call the partition you wish :

first_1000_rows = parted_df.partitions[0]

Note that unless the number of rows in your initial DataFrame is a multiple of 1000, you won't get exactly 1000 rows.

来源：https://stackoverflow.com/questions/49139371/slicing-out-a-few-rows-from-a-dask-dataframe

标签

dask