Slicing out a few rows from a `dask.DataFrame`

て烟熏妆下的殇ゞ 提交于 2019-12-06 10:51:14

If your dataframe has a sensibly partitioned index then I recommend using .loc

small = big.loc['2000':'2005']

If you want to maintain the same number of partitions, you might consider sample

small = big.sample(frac=0.01)

If you just want a single partition, you might try get_partition

small = big.get_partition(0)

You can also, always use to_delayed and from_delayed to build your own custom solution. http://dask.pydata.org/en/latest/dataframe-create.html#dask-delayed

More generally, Dask.dataframe doesn't keep row-counts per partition, so the specific question of "give me 1000 rows" ends up being surprisingly hard to answer. It's a lot easier to answer questions like "give me all the data in January" or "give me the first partition"

You may repartition your initial DataFrame into an arbitrary number of partitions. If you want slices of 1000 rows :

npart = round(len(df)/1000)
parted_df = df.repartition(npartitions=npart)

Then just call the partition you wish :

first_1000_rows = parted_df.partitions[0]

Note that unless the number of rows in your initial DataFrame is a multiple of 1000, you won't get exactly 1000 rows.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!