PySpark - Sum a column in dataframe and return results as int

前端 未结 6 1426
执念已碎
执念已碎 2020-12-24 08:13

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.

df = spark.cr         


        
6条回答
  •  滥情空心
    2020-12-24 08:27

    The simplest way really :

    df.groupBy().sum().collect()
    

    But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:

    df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
    

    I tried on a bigger dataset and i measured the processing time:

    RDD and ReduceByKey : 2.23 s

    GroupByKey: 30.5 s

提交回复
热议问题