Python Dask map_partitions

天涯浪子 提交于 2019-12-11 16:18:12

问题


Probably a continuation of this question, working from the dask docs examples for map_partitions.

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],     'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)

from random import randint

def myadd(df):
    new_value = df.x + randint(1,4)
    return new_value

res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute()
res

In the above code, randint is only being called once, not once per row as I would expect. How come?

Output:

X Y Z

1 1 4

2 2 5

3 3 6

4 4 7

5 5 8


回答1:


If you performed the same operation (df.x + randint(1,4)) on the original pandas dataframe, you would only get one random number, added to every previous value of the column. This is doing exactly the same as the pandas case, except that it is being called once for each partition - this is what map_partition does.

If you wanted a new random number for every row, you should first think of how you would achieve this with pandas. I can immediately think of two:

df.x.map(lambda x: x + random.randint(1, 4))

or

df.x + np.random.randint(1, 4, size=len(df.x))

If you replace your newvalue = line with one of these, it will work as expected.



来源:https://stackoverflow.com/questions/51602248/python-dask-map-partitions

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!