What is map_partitions doing?

爱⌒轻易说出口 提交于 2019-12-01 05:32:27

问题


The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like) a list whose length equals the number of partitions. Each element of the list should be one of the return values of the function calls.

However, with respect to the following code, I am not sure, what the return value depends on:

#generate example dataframe
pdf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(pdf, npartitions=3)

#define helper function for map. VAL is the return value
VAL = pd.Series({'A': 1})
#VAL = pd.DataFrame({'A': [1]}) #other return values used in this example
#VAL = None
#VAL = 1
def helper(x):
    print('function called\n')
    return VAL

#check result
out = ddf.map_partitions(helper).compute()
print(len(out))
  • VAL = pd.Series({'A': 1}) causes 4 function calls (probably one to infer the dtype and 3 for the partitions) and an output with len == 3 and the type pd.Series.
  • pd.DataFrame({'A': [1]}) results in the same numbers, however the resulting type is pd.DataFrame.
  • VAL = None causes an TypeError ... why? Couldn't a possible use of map_partitions be to do something rather than to return something?
  • VAL = 1 results in only 2 function calls. The result of map_partitions is the integer 1.

Therefore, I want to ask some questions:

  1. how is the return value of map_partitions determined?
  2. What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?
  3. What should be the return value of a function, that only "does" something, i.e. a procedure?
  4. How should a function be designed, that returns arbitrary objects?

回答1:


The Dask DataFrame.map_partitions function returns a new Dask Dataframe or Series, based on the output type of the mapped function. See the API documentation for a thorough explanation.

  1. How is the return value of map_partitions determined?

    See the API docs referred to above.

  2. What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?

    You're correct that we're calling it once immediately to guess the dtypes/columns of the output. You can avoid this by specifying a meta= keyword directly. Other than that the function is called once per partition.

  3. What should be the return value of a function, that only "does" something, i.e. a procedure?

    You could always return an empty dataframe. You might also want to consider transforming your dataframe into a sequence of dask.delayed objects, which are typically more often used for ad-hoc computations.

  4. How should a function be designed, that returns arbitrary objects?

    If your function doesn't return series/dataframes then I recommend converting your dataframe to a sequence of dask.delayed objects with the DataFrame.to_delayed method.



来源:https://stackoverflow.com/questions/39215617/what-is-map-partitions-doing

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!