On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

随声附和 提交于 2019-12-01 17:44:59

Dask does some checking on what you have told it to do before it tries to do it on the entire collection of partitions. That is where the first few print statements are coming from. It's part of the built in error checking that prevents Dask from going down some long winded series of operations and failing at the end.

@Grr 's answer is correct. Dask.dataframe doesn't know what your function will produce, but still has to provide a lazy dask.dataframe for you with the correct types, dtypes, etc., so it tries your function on a little bit of data.

You can avoid these checks by providing metadata about your intended output using the meta= keyword (more details in the DataFrame.apply docstring). If you provide this information then Dask.dataframe will not need to try your function to determine types.

Copying this section here:

Docstring

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Solution

So if you create an example output as an empty dataframe then you'll be fine:

meta = pd.DataFrame({'A': [1], 'B': [2], 'C': [3]}, 
                    columns=['A', 'B', 'C'])
ddf.apply(aggregate, axis=1, meta=meta)

Or, in this case because your function doesn't change the columns or dtype of the input, you can just use the input's meta

ddf.apply(aggregate, axis=1, meta=ddf.meta)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!