Explain the aggregate functionality in Spark

前端 未结 9 2214
误落风尘
误落风尘 2020-12-07 12:23

I am looking for some better explanation of the aggregate functionality that is available via spark in python.

The example I have is as follows (using pyspark from

9条回答
  •  情书的邮戳
    2020-12-07 12:38

    I don't have enough reputation points to comment on the previous answer by Maasg. Actually the zero value should be 'neutral' towards the seqop, meaning it wouldn't interfere with the seqop result, like 0 towards add, or 1 towards *;

    You should NEVER try with non-neutral values as it might be applied arbitrary times. This behavior is not only tied to num of partitions.

    I tried the same experiment as stated in the question. with 1 partition, the zero value was applied 3 times. with 2 partitions, 6 times. with 3 partitions, 9 times and this will go on.

提交回复
热议问题