Calculate the standard deviation of grouped data in a Spark DataFrame

前端未结

关注

 2  472

终归单人心 2020-12-05 05:27

I have user logs that I have taken from a csv and converted into a DataFrame in order to leverage the SparkSQL querying features. A single user will create numerous entries

2条回答

借酒劲吻你 (楼主)

2020-12-05 05:55

The accepted code does not compile, as it has a typo (as pointed out by MRez). The snippet below works and is tested.

For Spark 2.0+ :

import org.apache.spark.sql.functions._
val _avg_std = df.groupBy("user").agg(
        avg(col("duration").alias("avg")),
        stddev(col("duration").alias("stdev")),
        stddev_pop(col("duration").alias("stdev_pop")),
        stddev_samp(col("duration").alias("stdev_samp"))
        )

0 讨论(0)

查看其它2个回答