Explain the aggregate functionality in Spark

前端 未结 9 2186
误落风尘
误落风尘 2020-12-07 12:23

I am looking for some better explanation of the aggregate functionality that is available via spark in python.

The example I have is as follows (using pyspark from

9条回答
  •  既然无缘
    2020-12-07 12:54

    Great explanations, it really helped me to understand the underneath working of the aggregate function. I have played with it for some time and found out as below.

    • if you are using the acc as (0,0) then it will not change the result of the out put of the function.

    • if the initial accumulator is changed then it will process the result something as below

    [ sum of RDD elements + acc initial value * No. of RDD partitions + acc initial value ]

    for the question here, I would suggest to check the partitions as the number of partitions should be 8 as per my understanding as every time we process the seq op on a partition of RDD it will start with the initial sum of acc result and also when it is going to do the comb Op it will again use the acc initial value once.

    for e.g. List (1,2,3,4) & acc (1,0)

    Get partitions in scala by RDD.partitions.size

    if Partitions are 2 & number of elements is 4 then => [ 10 + 1 * 2 + 1 ] => (13,4)

    if Partition are 4 & number of elements is 4 then => [ 10 + 1 * 4 + 1 ] => (15,4)

    Hope this helps, you can check here for explanation. Thanks.

提交回复
热议问题