Calculate quantile on grouped data in spark Dataframe

前端 未结 1 1185
自闭症患者
自闭症患者 2021-01-02 10:47

I have the following Spark dataframe :

 agent_id|payment_amount|
+--------+--------------+
|       a|          1000|
|       b|          1100|
|       a|             


        
相关标签:
1条回答
  • 2021-01-02 11:02

    One solution would be to use percentile_approx :

    >>> test_df.registerTempTable("df")
    >>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")
    
    >>> df2.show()
    # +--------+-----------------+
    # |agent_id|   approxQuantile|
    # +--------+-----------------+
    # |       a|8239.999999999998|
    # |       b|7449.999999999998|
    # +--------+-----------------+ 
    

    Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.

    Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.

    Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.

    EDIT : From Spark 2+, HiveContext is not required.

    0 讨论(0)
提交回复
热议问题