I would like to sum
(or perform other aggregate functions too) on the array column using SparkSQL.
I have a table as
+-------+-------+-
A possible approach it to use explode()
on your Array
column and consequently aggregate the output by unique key. For example:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
(mytable
.withColumn("emp_sum",
explode($"emp_details"))
.groupBy("dept_nm")
.agg(sum("emp_sum")).show)
+-------+------------+
|dept_nm|sum(emp_sum)|
+-------+------------+
|Finance| 1500|
| IT| 180|
+-------+------------+
To select only specific values in your array, we can work with the answer from the linked question and apply it with a slight modification:
val slice = udf((array : Seq[Int], from : Int, to : Int) => array.slice(from,to))
(mytable
.withColumn("slice",
slice($"emp_details",
lit(0),
lit(3)))
.withColumn("emp_sum",
explode($"slice"))
.groupBy("dept_nm")
.agg(sum("emp_sum")).show)
+-------+------------+
|dept_nm|sum(emp_sum)|
+-------+------------+
|Finance| 600|
| IT| 80|
+-------+------------+
Data:
val data = Seq((10, "Finance", Array(100,200,300,400,500)),
(20, "IT", Array(10,20,50,100)))
val mytable = sc.parallelize(data).toDF("dept_id", "dept_nm","emp_details")
The rdd way is missing, so let me add it.
val df = Seq((10, "Finance", Array(100,200,300,400,500)),(20, "IT", Array(10,20,50,100))).toDF("dept_id", "dept_nm","emp_details")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x=> {val p = x.getAs[mutable.WrappedArray[Int]]("emp_details").toArray; Row.merge(x,Row(p.sum,p.slice(0,2).sum)) })
spark.createDataFrame(rdd1,df.schema.add(StructField("sumArray",IntegerType)).add(StructField("sliceArray",IntegerType))).show(false)
Output:
+-------+-------+-------------------------+--------+----------+
|dept_id|dept_nm|emp_details |sumArray|sliceArray|
+-------+-------+-------------------------+--------+----------+
|10 |Finance|[100, 200, 300, 400, 500]|1500 |300 |
|20 |IT |[10, 20, 50, 100] |180 |30 |
+-------+-------+-------------------------+--------+----------+
As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.
The "modern" solution would be as follows:
scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details |
+-------+-------+-------------------------+
|10 |Finance|[100, 200, 300, 400, 500]|
|20 |IT |[10, 20, 50, 100] |
+-------+-------+-------------------------+
input.createOrReplaceTempView("mytable")
val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
You can find a good reading on higher-order functions in the following articles and video:
DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map
. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.
One solution would be to use Dataset
solution where the combination of Spark SQL and Scala could show its power.
scala> val inventory = Seq(
| (10, "Finance", Seq(100, 200, 300, 400, 500)),
| (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]
// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
toDF("dept_id", "dept_nm", "sum").
show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
I'm leaving the slice part as an exercise as it's equally simple.
Building on zero323's awesome answer; in case you have an array of Long integers i.e. BIGINT, you need to change the initial value from 0 to BIGINT(0) as explained in the first paragraph here so you have
dfSliced.selectExpr(
"*",
"aggregate(emp_details, BIGINT(0), (x, y) -> x + y) as details_sum",
"aggregate(emp_details_sliced, BIGINT(0), (x, y) -> x + y) as details_sliced_sum"
).show
Since Spark 2.4 you can slice with the slice
function:
import org.apache.spark.sql.functions.slice
val df = Seq(
(10, "Finance", Seq(100, 200, 300, 400, 500)),
(20, "IT", Seq(10, 20, 50, 100))
).toDF("dept_id", "dept_nm", "emp_details")
val dfSliced = df.withColumn(
"emp_details_sliced",
slice($"emp_details", 1, 3)
)
dfSliced.show(false)
+-------+-------+-------------------------+------------------+
|dept_id|dept_nm|emp_details |emp_details_sliced|
+-------+-------+-------------------------+------------------+
|10 |Finance|[100, 200, 300, 400, 500]|[100, 200, 300] |
|20 |IT |[10, 20, 50, 100] |[10, 20, 50] |
+-------+-------+-------------------------+------------------+
and sum arrays with aggregate
:
dfSliced.selectExpr(
"*",
"aggregate(emp_details, 0, (x, y) -> x + y) as details_sum",
"aggregate(emp_details_sliced, 0, (x, y) -> x + y) as details_sliced_sum"
).show
+-------+-------+--------------------+------------------+-----------+------------------+
|dept_id|dept_nm| emp_details|emp_details_sliced|details_sum|details_sliced_sum|
+-------+-------+--------------------+------------------+-----------+------------------+
| 10|Finance|[100, 200, 300, 4...| [100, 200, 300]| 1500| 600|
| 20| IT| [10, 20, 50, 100]| [10, 20, 50]| 180| 80|
+-------+-------+--------------------+------------------+-----------+------------------+
Here is an alternative to mtoto's answer without using a groupBy
(I really don't know which one is fastest: UDF, mtoto solution or mine, comments welcome)
You would a performance impact on using a UDF
, in general. There is an answer which you might want to read and this resource is a good read on UDF.
Now for your problem, you can avoid the use of a UDF. What I would use is a Column
expression generated with Scala logic.
data:
val df = Seq((10, "Finance", Array(100,200,300,400,500)),
(20, "IT", Array(10, 20, 50,100)))
.toDF("dept_id", "dept_nm","emp_details")
You need some trickery to be able to traverse a ArrayType
, you can play a bit with the solution to discover various problems (see edit at the bottom for the slice
part). Here is my proposal but you might find better. First you take the maximum length
val maxLength = df.select(size('emp_details).as("l")).groupBy().max("l").first.getInt(0)
Then you use it, testing when you have a shorter array
val sumArray = (1 until maxLength)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")
val res = df
.select('dept_id,'dept_nm,'emp_details,sumArray)
result:
+-------+-------+--------------------+--------+
|dept_id|dept_nm| emp_details|sumArray|
+-------+-------+--------------------+--------+
| 10|Finance|[100, 200, 300, 4...| 1500|
| 20| IT| [10, 20, 50, 100]| 180|
+-------+-------+--------------------+--------+
I advise you to look at sumArray
to understand what it is doing.
Edit: Of course I only read half of the question again... But if you want to changes the items on which to sum, you can see that it becomes obvious with this solution (i.e. you don't need a slice function), just change (0 until maxLength)
with the range of index you need:
def sumArray(from: Int, max: Int) = (from until max)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")