Why is the fold action necessary in Spark?

前端 未结 1 377
生来不讨喜
生来不讨喜 2020-12-03 02:23

I\'ve a silly question involving fold and reduce in PySpark. I understand the difference between these two methods, but, if both need that the appl

相关标签:
1条回答
  • 2020-12-03 03:04

    Empty RDD

    It cannot be substituted when RDD is empty:

    val rdd = sc.emptyRDD[Int]
    rdd.reduce(_ + _)
    // java.lang.UnsupportedOperationException: empty collection at   
    // org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$ ...
    
    rdd.fold(0)(_ + _)
    // Int = 0
    

    You can of course combine reduce with condition on isEmpty but it is rather ugly.

    Mutable buffer

    Another use case for fold is aggregation with mutable buffer. Consider following RDD:

    import breeze.linalg.DenseVector
    
    val rdd = sc.parallelize(Array.fill(100)(DenseVector(1)), 8)
    

    Lets say we want a sum of all elements. A naive solution is to simply reduce with +:

    rdd.reduce(_ + _)
    

    Unfortunately it creates a new vector for each element. Since object creation and subsequent garbage collection is expensive it could be better to use a mutable object. It is not possible with reduce (immutability of RDD doesn't imply immutability of the elements), but can be achieved with fold as follows:

    rdd.fold(DenseVector(0))((acc, x) => acc += x)
    

    Zero element is used here as mutable buffer initialized once per partition leaving actual data untouched.

    acc = op(obj, acc), why this operation order is used instead of acc = op(acc, obj)

    See SPARK-6416 and SPARK-7683

    0 讨论(0)
提交回复
热议问题