I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name'. For this particular use case it doesn't look like I need to define meta but I'd like to know how to do that for future reference.
Dummy dataframe and the column frequencies
import pandas as pd from dask import dataframe as dd df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'], ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'], [12, 10, 15, 23, 18, 20, 26]], index=['Column A', 'Column B', 'Column C']).T dask_df = dd.from_pandas(df) In [39]: dask_df.head() Out[39]: Column A Column B Column C 0 Sam Sam 12 1 Alex David 10 2 David David 15 3 Sarah Alice 23 4 Alice Sam 18 (dask_df.groupby('Column B') .apply(lambda group: len(group)) ).compute() UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result warnings.warn(msg) Out[60]: Column B Alice 2 David 2 Sam 3 dtype: int64 Trying to define meta produces AttributeError
(dask_df.groupby('Column B') .apply(lambda d: len(d), meta={'Column B': 'int'})).compute() same for this
(dask_df.groupby('Column B') .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute() same if I try having the dtype be int instead of "int" or for that matter 'f8' or np.float64 so it doesn't seem like it's the dtype that is causing the problem.
The documentation on meta seems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).
What is meta? and how am I supposed to define it?
Using python 3.6 dask 0.14.3 and pandas 0.20.2