How to map a column with dask

旧城冷巷雨未停 提交于 2019-12-01 20:03:35

You can use the .map method, exactly as in Pandas

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'x': [1, 2, 3]})

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: df.x.map(lambda x: x + 1)
Out[5]: 
0    2
1    3
2    4
Name: x, dtype: int64

In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]: 
0    2
1    3
2    4
Name: x, dtype: int64

Metadata

You may be asked to provide a meta= keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitions here:

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and 
column names of the output. This metadata is necessary for many 
algorithms in dask dataframe to work. For ease of use, some 
alternative inputs are also available. Instead of a DataFrame, 
a dict of {name: dtype} or iterable of (name, dtype) can be 
provided. Instead of a series, a tuple of (name, dtype) can be 
used. If not provided, dask will try to infer the metadata. 
This may lead to unexpected results, so providing meta is  
recommended. 

For more information, see dask.dataframe.utils.make_meta.

So in the example above, where my output will be a series with name 'x' and dtype int I can do either of the following to be more explicit

>>> ddf.x.map(lambda x: x + 1, meta=('x', int))

or

>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))

This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!