How to slice and sum elements of array column?

后端 未结 6 1698
暖寄归人
暖寄归人 2020-12-03 13:09

I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.

I have a table as

+-------+-------+-         


        
6条回答
  •  长情又很酷
    2020-12-03 13:41

    Since Spark 2.4 you can slice with the slice function:

    import org.apache.spark.sql.functions.slice
    
    val df = Seq(
      (10, "Finance", Seq(100, 200, 300, 400, 500)),
      (20, "IT", Seq(10, 20, 50, 100))
    ).toDF("dept_id", "dept_nm", "emp_details")
    
    val dfSliced = df.withColumn(
       "emp_details_sliced",
       slice($"emp_details", 1, 3)
    )
    
    dfSliced.show(false)
    
    +-------+-------+-------------------------+------------------+
    |dept_id|dept_nm|emp_details              |emp_details_sliced|
    +-------+-------+-------------------------+------------------+
    |10     |Finance|[100, 200, 300, 400, 500]|[100, 200, 300]   |
    |20     |IT     |[10, 20, 50, 100]        |[10, 20, 50]      |
    +-------+-------+-------------------------+------------------+
    

    and sum arrays with aggregate:

    dfSliced.selectExpr(
      "*", 
      "aggregate(emp_details, 0, (x, y) -> x + y) as details_sum",  
      "aggregate(emp_details_sliced, 0, (x, y) -> x + y) as details_sliced_sum"
    ).show
    
    +-------+-------+--------------------+------------------+-----------+------------------+
    |dept_id|dept_nm|         emp_details|emp_details_sliced|details_sum|details_sliced_sum|
    +-------+-------+--------------------+------------------+-----------+------------------+
    |     10|Finance|[100, 200, 300, 4...|   [100, 200, 300]|       1500|               600|
    |     20|     IT|   [10, 20, 50, 100]|      [10, 20, 50]|        180|                80|
    +-------+-------+--------------------+------------------+-----------+------------------+
    

提交回复
热议问题