I am looking for some better explanation of the aggregate functionality that is available via spark in python.
The example I have is as follows (using pyspark from
I don't have enough reputation points to comment on the previous answer by Maasg. Actually the zero value should be 'neutral' towards the seqop, meaning it wouldn't interfere with the seqop result, like 0 towards add, or 1 towards *;
You should NEVER try with non-neutral values as it might be applied arbitrary times. This behavior is not only tied to num of partitions.
I tried the same experiment as stated in the question. with 1 partition, the zero value was applied 3 times. with 2 partitions, 6 times. with 3 partitions, 9 times and this will go on.