How to sum the values of one column of a dataframe in spark/scala

前端 未结 5 733
鱼传尺愫
鱼传尺愫 2020-12-08 07:03

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

I want to sum the values of each column, for instance the tot

相关标签:
5条回答
  • 2020-12-08 07:21

    Simply apply aggregation function, Sum on your column

    df.groupby('steps').sum().show()
    

    Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

    Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

    0 讨论(0)
  • 2020-12-08 07:29

    Using spark sql query..just incase if it helps anyone!

    import org.apache.spark.sql.SparkSession 
    import org.apache.spark.SparkConf 
    import org.apache.spark.sql.functions._ 
    import org.apache.spark.SparkContext 
    import java.util.stream.Collectors
    
    val conf = new SparkConf().setMaster("local[2]").setAppName("test")
    val spark = SparkSession.builder.config(conf).getOrCreate()
    val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
    
    df.createOrReplaceTempView("steps")
    val sum = spark.sql("select  sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
    println("steps sum = " + sum) //prints 28
    
    0 讨论(0)
  • 2020-12-08 07:32

    You must first import the functions:

    import org.apache.spark.sql.functions._
    

    Then you can use them like this:

    val df = CSV.load(args(0))
    val sumSteps =  df.agg(sum("steps")).first.get(0)
    

    You can also cast the result if needed:

    val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
    

    Edit:

    For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

    val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
    

    Edit2:

    For dynamically applying the aggregations, the following options are available:

    • Applying to all numeric columns at once:
    df.groupBy().sum()
    
    • Applying to a list of numeric column names:
    val columnNames = List("col1", "col2")
    df.groupBy().sum(columnNames: _*)
    
    • Applying to a list of numeric column names with aliases and/or casts:
    val cols = List("col1", "col2")
    val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
    df.groupBy().agg(sums.head, sums.tail:_*).show()
    
    0 讨论(0)
  • 2020-12-08 07:40

    Not sure this was around when this question was asked but:

    df.describe().show("columnName")
    

    gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()

    0 讨论(0)
  • 2020-12-08 07:41

    If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

    import sqlContext.implicits._
    import org.apache.spark.sql.functions._
    
    val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
    df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
    
    //res1 Int = 19
    
    0 讨论(0)
提交回复
热议问题