问题
Probably a continuation of this question, working from the dask docs examples for map_partitions.
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)
from random import randint
def myadd(df):
new_value = df.x + randint(1,4)
return new_value
res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute()
res
In the above code, randint is only being called once, not once per row as I would expect. How come?
Output:
X Y Z
1 1 4
2 2 5
3 3 6
4 4 7
5 5 8
回答1:
If you performed the same operation (df.x + randint(1,4)) on the original pandas dataframe, you would only get one random number, added to every previous value of the column. This is doing exactly the same as the pandas case, except that it is being called once for each partition - this is what map_partition does.
If you wanted a new random number for every row, you should first think of how you would achieve this with pandas. I can immediately think of two:
df.x.map(lambda x: x + random.randint(1, 4))
or
df.x + np.random.randint(1, 4, size=len(df.x))
If you replace your newvalue = line with one of these, it will work as expected.
来源:https://stackoverflow.com/questions/51602248/python-dask-map-partitions