I am looking for some better explanation of the aggregate functionality that is available via spark in python.
The example I have is as follows (using pyspark from
For people looking for Scala Equivalent code for the above example - here it is. Same logic, same input/result.
scala> val listRDD = sc.parallelize(List(1,2,3,4), 2)
listRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21
scala> listRDD.collect()
res7: Array[Int] = Array(1, 2, 3, 4)
scala> listRDD.aggregate((0,0))((acc, value) => (acc._1+value,acc._2+1),(acc1,acc2) => (acc1._1+acc2._1,acc1._2+acc2._2))
res10: (Int, Int) = (10,4)