问题
I am new to Spark and I have some questions about the aggregation function MAX
and MIN
in SparkSQL
In SparkSQL, when I use the MAX
/ MIN
function only MAX(value)
/ MIN(value)
is returned.
But How about if I also want other corresponding column?
For e.g. Given a dataframe with columns time
, value
and label
, how can I get the time
with the MIN(Value)
grouped by label
?
Thanks.
回答1:
You need to do a first do a groupBy
, and then join
that back to the original DataFrame
. In Scala, it looks like this:
df.join(
df.groupBy($"label").agg(min($"value") as "min_value").withColumnRenamed("label", "min_label"),
$"min_label" === $"label" && $"min_value" === $"value"
).drop("min_label").drop("min_value").show
I don't use Python, but it would look close to the above.
You can even do max()
and min()
in one pass:
df.join(
df.groupBy($"label")
.agg(min($"value") as "min_value", max($"value") as "max_value")
.withColumnRenamed("label", "r_label"),
$"r_label" === $"label" && ($"min_value" === $"value" || $"max_value" === $"value")
).drop("r_label")
回答2:
You can use sortByKey(true) for sorting by ascending order and then apply action "take(1)" to get Max.
And use sortByKey(false) for sorting by descending order and then apply action "take(1)" to get Min
If you want to use spark-sql way, you can follow the approach explained by @maxymoo
来源:https://stackoverflow.com/questions/36050923/max-and-min-of-spark