Can a dask dataframe with a unordered index cause silent errors?

谁说胖子不能爱 提交于 2019-12-23 15:35:05

问题


Methods around dask.DataFrame all seem to make sure, that the index column is sorted. However, by using from_delayed, it is possible to construct a dask dataframe that has a index column, which is not sorted:

pdf1 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A'))
pdf2 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A'))
ddf = dd.from_delayed([pdf1,pdf2]) #dask.DataFrame with unordered index

The combination [index is set, index is not sorted, divisions are unknown] is something that I have never seen among dataframes that dask created itself. So my questions are:

  • Is dask tested to work well with dataframes like this?
  • Might it even be that calculations on such dataframes give wrong results silently, e.g. because they assume the index to be sorted or are performed on an incomplete subset of data?
  • Or more general: If the index column is not sorted, does it only slow down access by index or does it break functionality?

回答1:


Many dask.dataframe operations will refuse to operate or will operate with slower algorithms on dataframes without known divisions. See http://dask.pydata.org/en/latest/dataframe-design.html#partitions

For example df.loc is fast if dask.dataframe knows that the index is sorted and it knows the min/max of each partition. However if this information is not known then df.loc has to look through all of the partitions exhaustively.

Generally speaking dask.dataframe is aware of the possibility that you bring up and should act accordingly. Some operations will be slower. Some operations will refuse to operate.



来源:https://stackoverflow.com/questions/41268080/can-a-dask-dataframe-with-a-unordered-index-cause-silent-errors

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!