问题
Often, when working with a large dask.DataFrame
, it would be useful to grab only a few rows on which to test all subsequent operations.
Currently, according to Slicing a Dask Dataframe, this is unsupported.
- I was hoping to then use
head
to achieve the same (since that command is supported), but that returns a regular pandas DataFrame. - I also tried
df[:1000]
, which executes, but generates an output different from that you'd expect from Pandas.
Is there any way to grab the first 1000 rows from a dask.DataFrame
?
回答1:
If your dataframe has a sensibly partitioned index then I recommend using .loc
small = big.loc['2000':'2005']
If you want to maintain the same number of partitions, you might consider sample
small = big.sample(frac=0.01)
If you just want a single partition, you might try get_partition
small = big.get_partition(0)
You can also, always use to_delayed
and from_delayed
to build your own custom solution. http://dask.pydata.org/en/latest/dataframe-create.html#dask-delayed
More generally, Dask.dataframe doesn't keep row-counts per partition, so the specific question of "give me 1000 rows" ends up being surprisingly hard to answer. It's a lot easier to answer questions like "give me all the data in January" or "give me the first partition"
回答2:
You may repartition your initial DataFrame into an arbitrary number of partitions. If you want slices of 1000 rows :
npart = round(len(df)/1000)
parted_df = df.repartition(npartitions=npart)
Then just call the partition you wish :
first_1000_rows = parted_df.partitions[0]
Note that unless the number of rows in your initial DataFrame is a multiple of 1000, you won't get exactly 1000 rows.
来源:https://stackoverflow.com/questions/49139371/slicing-out-a-few-rows-from-a-dask-dataframe