Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

后端 未结 4 1499
温柔的废话
温柔的废话 2020-11-28 21:31

I want to share this particular Apache Spark with Python solution because documentation for it is quite poor.

I wanted to calculate the average value of K/V pairs (s

4条回答
  •  日久生厌
    2020-11-28 21:56

    To my mind a more readable equivalent to an aggregateByKey with two lambdas is:

    rdd1 = rdd1 \
        .mapValues(lambda v: (v, 1)) \
        .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1]))
    

    In this way the whole average calculation would be:

    avg_by_key = rdd1 \
        .mapValues(lambda v: (v, 1)) \
        .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \
        .mapValues(lambda v: v[0]/v[1]) \
        .collectAsMap()
    

提交回复
热议问题