Max and Min of Spark [duplicate]

问题

I am new to Spark and I have some questions about the aggregation function MAX and MIN in SparkSQL

In SparkSQL, when I use the MAX / MIN function only MAX(value) / MIN(value) is returned. But How about if I also want other corresponding column?

For e.g. Given a dataframe with columns time, value and label, how can I get the time with the MIN(Value) grouped by label?

Thanks.

回答1:

You need to do a first do a groupBy, and then join that back to the original DataFrame. In Scala, it looks like this:

df.join(
  df.groupBy($"label").agg(min($"value") as "min_value").withColumnRenamed("label", "min_label"), 
  $"min_label" === $"label" && $"min_value" === $"value"
).drop("min_label").drop("min_value").show

I don't use Python, but it would look close to the above.

You can even do max() and min() in one pass:

df.join(
  df.groupBy($"label")
    .agg(min($"value") as "min_value", max($"value") as "max_value")
    .withColumnRenamed("label", "r_label"), 
  $"r_label" === $"label" && ($"min_value" === $"value" || $"max_value" === $"value")
).drop("r_label")

回答2:

You can use sortByKey(true) for sorting by ascending order and then apply action "take(1)" to get Max.

And use sortByKey(false) for sorting by descending order and then apply action "take(1)" to get Min

If you want to use spark-sql way, you can follow the approach explained by @maxymoo

来源：https://stackoverflow.com/questions/36050923/max-and-min-of-spark

标签

apache-spark

pyspark

apache-spark-sql

pyspark-sql