How to compute the largest value in a column using withColumn?

问题

I'm trying to compute the largest value of the following DataFrame in Spark 1.6.1:

val df = sc.parallelize(Seq(1,2,3)).toDF("id")

A first approach would be to select the maximum value, and it works as expected:

df.select(max($"id")).show

The second approach could be to use withColumn as follows:

df.withColumn("max", max($"id")).show

But unfortunately it fails with the following error message:

org.apache.spark.sql.AnalysisException: expression 'id' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

How can I compute the maximum value in a withColumn function without any Window or groupBy? If not possible, how can I do it in this specific case using a Window?

回答1:

The right approach is to compute an aggregate as a separate query and combine with the actual result. Unlike window functions, suggested in many answers here, it won't require shuffle to a single partition and will be applicable to large datasets.

It could be done withColumn using a separate action:

import org.apache.spark.sql.functions.{lit, max}

df.withColumn("max", lit(df.agg(max($"id")).as[Int].first))

but it is much cleaner to use either explicit:

import org.apache.spark.sql.functions.broadcast

df.crossJoin(broadcast(df.agg(max($"id") as "max")))

or implicit cross join:

spark.conf.set("spark.sql.crossJoin.enabled", true)

df.join(broadcast(df.agg(max($"id") as "max")))

回答2:

There are few categories of functions in Apache Spark.

Aggregate functions, e.g. max, when we wanna aggregate multiple rows in to one
None-aggregate functions, abs, isnull, when we wanna transform one column to another
Collection functions, e.g. explode, when one row will expand to multiple rows.

Implicit aggregation

They are used to when we wanna aggregate more rows in to one.

The following code internally has an aggregation.

df.select(max($"id")).explain

== Physical Plan ==
*HashAggregate(keys=[], functions=[max(id#3)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_max(id#3)])
      +- *Project [value#1 AS id#3]
         +- Scan ExistingRDD[value#1]

we can also using multiple aggregation functions in a select.

df.select(max($"id"), min($"id")).explain

aggregate functions can not mix with none-aggregate functions directly

The following code will report error.

df.select(max($"id"), $"id")

df.withColumn("max", max($"id"))

Because max($"id") has few values then $"id"

aggregate with `over`

In this case the analytic function is applied and presented for all rows in the result set.

We can use

df.select(max($"id").over, $"id").show

df.withColumn("max", max($"id").over).show

回答3:

This is Spark 2.0 here.

With withColumn and window functions it could be as follows:

df.withColumn("max", max('id) over)

Note the empty over which is to assume a "empty" window (and is equivalent of over ()).

If you however need a more complete WindowSpec you can do the following (again, this is 2.0):

import org.apache.spark.sql.expressions._
// the trick that has performance cost (!)
val window = Window.orderBy()
df.withColumn("max", max('id) over window).show

Please note that the code has a serious performance issue as reported by Spark itself:

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

来源：https://stackoverflow.com/questions/40759937/how-to-compute-the-largest-value-in-a-column-using-withcolumn

标签

apache-spark

dataframe

apache-spark-sql

aggregate-functions