simple dask map_partitions example

◇◆丶佛笑我妖孽 提交于 2019-11-30 20:46:17

There is an example in map_partitions docs to achieve exactly what are trying to do:

ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))

When you call map_partitions (just like when you call .apply() on pandas.DataFrame), the function that you try to map (or apply) will be given dataframe as a first argument.

In case of dask.dataframe.map_partitions this first argument will be a partition and in case of pandas.DataFrame.apply - a whole dataframe.

Which means that your function has to accept dataframe(partition) as a first argument and and in your case could look like this:

def test_f(df, col_1, col_2):
    return df.assign(result=df[col_1] * df[col_2])

Note that assignment of a new column in this case happens (i.e. gets scheduled to happen) BEFORE you call .compute().

In your example you assign column AFTER you call .compute(), which kind of defeats the purpose of using dask. I.e. after you call .compute() the results of that operation are loaded into memory if there is enough space for those results (if not you just get MemoryError).

So for you example to work you could:

1) Use function (with column names as arguments):

def test_f(df, col_1, col_2):
    return df.assign(result=df[col_1] * df[col_2])


ddf_out = ddf.map_partitions(test_f, 'col_1', 'col_2')

# Here is good place to do something with BIG ddf_out dataframe before calling .compute()

result = ddf_out.compute(get=get)  # Will load the whole dataframe into memory

2) Use lambda (with column names hardcoded in the function):

ddf_out = ddf.map_partitions(lambda df: df.assign(result=df.col_1 * df.col_2))

# Here is good place to do something with BIG ddf_out dataframe before calling .compute()

result = ddf_out.compute(get=get)  # Will load the whole dataframe into memory

Update:

To apply function on a row-by-row basis, here is a quote from the post you linked:

map / apply

You can map a function row-wise across a series with map

df.mycolumn.map(func)

You can map a function row-wise across a dataframe with apply

df.apply(func, axis=1)

I.e. for the example function in your question, it might look like this:

def test_f(dds, col_1, col_2):
    return dds[col_1] * dds[col_2]

Since you will be applying it on a row-by-row basis the function's first argument will be a series (i.e. each row of a dataframe is a series).

To apply this function then you might call it like this:

dds_out = ddf.apply(
    test_f, 
    args=('col_1', 'col_2'), 
    axis=1, 
    meta=('result', int)
).compute(get=get)

This will return a series named 'result'.

I guess you could also call .apply on each partition with a function but it does not look to be any more efficient then calling .apply on dataframe directly. But may be your tests will prove otherwise.

Your test_f takes two arguments: col_1 and col_2. You pass a single argument, ddf.

Try something like

In [5]: dd.map_partitions(test_f, ddf['col_1'], ddf['col_2'])
Out[5]:
Dask Series Structure:
npartitions=8
0       int64
1250      ...
        ...
8750      ...
9999      ...
dtype: int64
Dask Name: test_f, 32 tasks
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!