发表新帖

发表新帖

Explain the aggregate functionality in Spark

前端未结

关注

 9  2186

误落风尘 2020-12-07 12:23

I am looking for some better explanation of the aggregate functionality that is available via spark in python.

The example I have is as follows (using pyspark from

9条回答

既然无缘 (楼主)

2020-12-07 12:54
Great explanations, it really helped me to understand the underneath working of the aggregate function. I have played with it for some time and found out as below.
- if you are using the acc as (0,0) then it will not change the result of the out put of the function.
- if the initial accumulator is changed then it will process the result something as below
[ sum of RDD elements + acc initial value * No. of RDD partitions + acc initial value ]

for the question here, I would suggest to check the partitions as the number of partitions should be 8 as per my understanding as every time we process the seq op on a partition of RDD it will start with the initial sum of acc result and also when it is going to do the comb Op it will again use the acc initial value once.

for e.g. List (1,2,3,4) & acc (1,0)

Get partitions in scala by RDD.partitions.size

if Partitions are 2 & number of elements is 4 then => [ 10 + 1 * 2 + 1 ] => (13,4)

if Partition are 4 & number of elements is 4 then => [ 10 + 1 * 4 + 1 ] => (15,4)

Hope this helps, you can check here for explanation. Thanks.
0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...

热议问题