Max and Min of Spark [duplicate]

烂漫一生 提交于 2019-12-08 12:58:12

问题


I am new to Spark and I have some questions about the aggregation function MAX and MIN in SparkSQL

In SparkSQL, when I use the MAX / MIN function only MAX(value) / MIN(value) is returned. But How about if I also want other corresponding column?

For e.g. Given a dataframe with columns time, value and label, how can I get the time with the MIN(Value) grouped by label?

Thanks.


回答1:


You need to do a first do a groupBy, and then join that back to the original DataFrame. In Scala, it looks like this:

df.join(
  df.groupBy($"label").agg(min($"value") as "min_value").withColumnRenamed("label", "min_label"), 
  $"min_label" === $"label" && $"min_value" === $"value"
).drop("min_label").drop("min_value").show

I don't use Python, but it would look close to the above.

You can even do max() and min() in one pass:

df.join(
  df.groupBy($"label")
    .agg(min($"value") as "min_value", max($"value") as "max_value")
    .withColumnRenamed("label", "r_label"), 
  $"r_label" === $"label" && ($"min_value" === $"value" || $"max_value" === $"value")
).drop("r_label")



回答2:


You can use sortByKey(true) for sorting by ascending order and then apply action "take(1)" to get Max.

And use sortByKey(false) for sorting by descending order and then apply action "take(1)" to get Min

If you want to use spark-sql way, you can follow the approach explained by @maxymoo



来源:https://stackoverflow.com/questions/36050923/max-and-min-of-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!