How to slice and sum elements of array column?

后端 未结 6 1697
暖寄归人
暖寄归人 2020-12-03 13:09

I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.

I have a table as

+-------+-------+-         


        
6条回答
  •  佛祖请我去吃肉
    2020-12-03 13:23

    A possible approach it to use explode() on your Array column and consequently aggregate the output by unique key. For example:

    import sqlContext.implicits._
    import org.apache.spark.sql.functions._
    
    (mytable
      .withColumn("emp_sum",
        explode($"emp_details"))
      .groupBy("dept_nm")
      .agg(sum("emp_sum")).show)
    +-------+------------+
    |dept_nm|sum(emp_sum)|
    +-------+------------+
    |Finance|        1500|
    |     IT|         180|
    +-------+------------+
    

    To select only specific values in your array, we can work with the answer from the linked question and apply it with a slight modification:

    val slice = udf((array : Seq[Int], from : Int, to : Int) => array.slice(from,to))
    
    (mytable
      .withColumn("slice", 
        slice($"emp_details", 
          lit(0), 
          lit(3)))
      .withColumn("emp_sum",
        explode($"slice"))
      .groupBy("dept_nm")
      .agg(sum("emp_sum")).show)
    +-------+------------+
    |dept_nm|sum(emp_sum)|
    +-------+------------+
    |Finance|         600|
    |     IT|          80|
    +-------+------------+
    

    Data:

    val data = Seq((10, "Finance", Array(100,200,300,400,500)),
                   (20, "IT", Array(10,20,50,100)))
    val mytable = sc.parallelize(data).toDF("dept_id", "dept_nm","emp_details")
    

提交回复
热议问题