问题
I have the following pandas dataframe with the following columns
user_id user_agent_id requests
All columns contain integers. I wan't to perform some operations on them and run them using dask dataframe. This is what I do.
user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \
.groupby(['user_id', 'user_agent_id']) \
.size().to_frame(name='appearances') \
.reset_index() # I am not sure I can run this on dask dataframe
user_profile_ddf = df.from_pandas(user_profile, npartitions=4)
user_profile_ddf['percent'] = user_profile_ddf.groupby('user_id')['appearances'] \
.apply(lambda x: x / x.sum(), meta=float) #Percentage of appearance for each user group
But I get the following error
raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
Am I doing something wrong? In pure pandas it works great but it gets slow for many lines (although they fit in memory) so I want to parallelize the computations.
回答1:
When creating the dask dataframe add the reset_index():
user_profile_ddf = df.from_pandas(user_profile, npartitions=4).reset_index()
来源:https://stackoverflow.com/questions/45030651/valueerror-not-all-divisions-are-known-cant-align-partitions-error-on-dask-da