ValueError: Not all divisions are known, can't align partitions error on dask dataframe

自作多情 提交于 2019-12-08 21:45:48

问题


I have the following pandas dataframe with the following columns

user_id user_agent_id requests

All columns contain integers. I wan't to perform some operations on them and run them using dask dataframe. This is what I do.

user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \
    .groupby(['user_id', 'user_agent_id']) \
    .size().to_frame(name='appearances') \
    .reset_index() # I am not sure I can run this on dask dataframe

user_profile_ddf = df.from_pandas(user_profile, npartitions=4)
user_profile_ddf['percent'] = user_profile_ddf.groupby('user_id')['appearances'] \
    .apply(lambda x: x / x.sum(), meta=float) #Percentage of appearance for each user group

But I get the following error

raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

Am I doing something wrong? In pure pandas it works great but it gets slow for many lines (although they fit in memory) so I want to parallelize the computations.


回答1:


When creating the dask dataframe add the reset_index():

user_profile_ddf = df.from_pandas(user_profile, npartitions=4).reset_index()


来源:https://stackoverflow.com/questions/45030651/valueerror-not-all-divisions-are-known-cant-align-partitions-error-on-dask-da

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!