I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.
I have a table as
+-------+-------+-
Here is an alternative to mtoto's answer without using a groupBy (I really don't know which one is fastest: UDF, mtoto solution or mine, comments welcome)
You would a performance impact on using a UDF, in general. There is an answer which you might want to read and this resource is a good read on UDF.
Now for your problem, you can avoid the use of a UDF. What I would use is a Column expression generated with Scala logic.
data:
val df = Seq((10, "Finance", Array(100,200,300,400,500)),
(20, "IT", Array(10, 20, 50,100)))
.toDF("dept_id", "dept_nm","emp_details")
You need some trickery to be able to traverse a ArrayType, you can play a bit with the solution to discover various problems (see edit at the bottom for the slice part). Here is my proposal but you might find better. First you take the maximum length
val maxLength = df.select(size('emp_details).as("l")).groupBy().max("l").first.getInt(0)
Then you use it, testing when you have a shorter array
val sumArray = (1 until maxLength)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")
val res = df
.select('dept_id,'dept_nm,'emp_details,sumArray)
result:
+-------+-------+--------------------+--------+
|dept_id|dept_nm| emp_details|sumArray|
+-------+-------+--------------------+--------+
| 10|Finance|[100, 200, 300, 4...| 1500|
| 20| IT| [10, 20, 50, 100]| 180|
+-------+-------+--------------------+--------+
I advise you to look at sumArray to understand what it is doing.
Edit: Of course I only read half of the question again... But if you want to changes the items on which to sum, you can see that it becomes obvious with this solution (i.e. you don't need a slice function), just change (0 until maxLength) with the range of index you need:
def sumArray(from: Int, max: Int) = (from until max)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")