I want to share this particular Apache Spark with Python solution because documentation for it is quite poor.
I wanted to calculate the average value of K/V pairs (s
To my mind a more readable equivalent to an aggregateByKey with two lambdas is:
rdd1 = rdd1 \
.mapValues(lambda v: (v, 1)) \
.reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1]))
In this way the whole average calculation would be:
avg_by_key = rdd1 \
.mapValues(lambda v: (v, 1)) \
.reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \
.mapValues(lambda v: v[0]/v[1]) \
.collectAsMap()