Calculate the standard deviation of grouped data in a Spark DataFrame

前端 未结 2 472
终归单人心
终归单人心 2020-12-05 05:27

I have user logs that I have taken from a csv and converted into a DataFrame in order to leverage the SparkSQL querying features. A single user will create numerous entries

2条回答
  •  借酒劲吻你
    2020-12-05 05:55

    The accepted code does not compile, as it has a typo (as pointed out by MRez). The snippet below works and is tested.

    For Spark 2.0+ :

    import org.apache.spark.sql.functions._
    val _avg_std = df.groupBy("user").agg(
            avg(col("duration").alias("avg")),
            stddev(col("duration").alias("stdev")),
            stddev_pop(col("duration").alias("stdev_pop")),
            stddev_samp(col("duration").alias("stdev_samp"))
            )
    

提交回复
热议问题