How to use pandas.cut() (or equivalent) in dask efficiently?

我的未来我决定 提交于 2019-12-06 23:10:06

问题


Is there an equivalent to pandas.cut() in Dask?

I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do binning in energy classes.

So far I could do it with pandas, but I would like to run it in parallel. So, I try to use dask.

The groupby method works very well, but unfortunately, I run into difficulties when trying to bin the data in energy. I found a solution using pandas.cut(), but it requires to call compute() on the raw dataset (turning it essentialy into non-parallel code). Is there an equivalent to pandas.cut() in dask, or is there another (elegant) way to achieve the same functionality?

import dask 
# create dask dataframe from the array
dd = dask.dataframe.from_array(mainArray, chunksize=100000, columns=('posX','posY', 'time', 'energy'))

# Set the bins to bin along energy
bins = range(0, 10000, 500)

# Create the cut in energy (using non-parallel pandas code...)
energyBinner=pandas.cut(dd['energy'],bins)

# Group the data according to posX, posY and energy
grouped = dd.compute().groupby([energyBinner, 'posX', 'posY'])

# Apply the count() method to the data:
numberOfEvents = grouped['time'].count()

Thanks a lot!


回答1:


You should be able to do dd['energy'].map_partitions(pd.cut, bins).



来源:https://stackoverflow.com/questions/42442043/how-to-use-pandas-cut-or-equivalent-in-dask-efficiently

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!