问题
I have written below code to group and aggregate the columns
val gmList = List("gc1","gc2","gc3")
val aList = List("val1","val2","val3","val4","val5")
val cype = "first"
val exprs = aList.map((_ -> cype )).toMap
dfgroupBy(gmList.map (col): _*).agg (exprs).show
but this create a columns with appending aggregation name in all column as shown below
so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
| gc1 | gc2 | gc3 | first(val1) | first(val2)| first(val3) | first(val4) | first(val5) |
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
回答1:
One approach would be to alias the aggregated columns to the original column names in a subsequent select
. I would also suggest generalizing the single aggregate function (i.e. first
) to a list of functions, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 10, "a1", "a2", "a3"),
(1, 10, "b1", "b2", "b3"),
(2, 20, "c1", "c2", "c3"),
(2, 30, "d1", "d2", "d3"),
(2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")
val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")
// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")
val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))
df.
groupBy(gmList.map(col): _*).agg(afPairs.toMap).
select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// | 2| 20| c1| c2| c3|
// | 1| 10| a1| a2| a3|
// | 2| 30| d1| d2| d3|
// +---+---+----+----+----+
回答2:
You can slightly change the way you are generating the expression and use the function alias
in there:
import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show
回答3:
Here's a more generic version that will work with any aggregate functions and doesn't require naming your aggregate columns up front. Build your grouped df
as you normally would, then use:
val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)
This will extract out only what is inside the parentheses, regardless of what aggregate function is called (e.g. first(val) -> val
, sum(val) -> val
, count(val) -> val
, etc.).
来源:https://stackoverflow.com/questions/53002360/spark-expression-rename-the-column-list-after-aggregation