Why are aggregate and fold two different APIs in Spark?

允我心安 提交于 2019-12-04 04:43:47

fold can be implemented more efficiently because it doesn't depend on a fixed order of evaluation. So each cluster node can fold its own chunk in parallel, and then one small overall fold at the end. Whereas with foldLeft each element has to be folded in in order and nothing can be done in parallel.

(Also it's nice to have a simpler API for the common case for convenience. The standard lib has reduce as well as foldLeft for this reason)

Specifically in Spark, the computation is distributed and done in parallel, so foldLeft can't be implemented as it is in the standard library. Instead, the aggregate requires two functions, one that performs an operation similar to fold on each element of type T, producing a value of type U, and another that combines the U from each partition into the final value:

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

foldLeft, foldRight, reduceLeft, reduceRight, scanLeft and scanRight are operations where the accumulated parameter can be different from the input parameters ((A, B) -> B) and those operations can only be executed sequentially.

fold is an operation where the accumulated parameter has to be the same type of the input parameters ((A, A) -> A). Then it can be executed in parallel.

aggregation is an operation where the accumulated parameter can be of different type as the input parameters, but then you have to provide an additional function that defines how the accumulated parameters can be aggregated in the final result. This operation allows parallel execution. The aggregation operation is a combination of foldLeft and fold.

For more detailed information, you can have a look at the coursera videos for the "Parallel programming" course:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!