Finding Percentile in Spark-Scala per a group

前端 未结 2 1353
情深已故
情深已故 2020-12-06 15:29

I am trying to do a percentile over a column using a Window function as below. I have referred here to use the ApproxQuantile definition over a group.



        
相关标签:
2条回答
  • 2020-12-06 15:50

    I have a solution for you that is extremely unelegant and works only if you have a limited number of possible bucketing.

    My first version is very ugly.

    // for the sake of clarity, let's define a function that generates the
    // window aggregation
    def per(x : Int) = percentile_approx(col("count"), typedLit(doBucketing(x)))
                            .over(Window.partitionBy("ID"))
    
    // then, we simply try to match the Bucket column with a possible value
    val res = df1
        .join(idBucketMapping, Seq("ID"))
        .withColumn("percentile", when('Bucket === 2, per(2)
                         .otherwise(when('Bucket === 3, per(3))
                         .otherwise(per(4)))
        )
    

    That's nasty but it works in your case. Slightly less ugly but very same logic, you can define a set of possible numbers of buckets and use it to do the same thing as above.

    val possible_number_of_buckets = 2 to 5
    
    val res = df1
        .join(idBucketMapping, Seq("ID"))
        .withColumn("percentile", possible_number_of_buckets
                    .tail
                    .foldLeft(per(possible_number_of_buckets.head))
                             ((column, size) => when('Bucket === size, per(size))
                                                  .otherwise(column)))
    
    0 讨论(0)
  • 2020-12-06 16:03

    percentile_approx takes percentage and accuracy. It seems, they both must be a constant literal. Thus we can't compute the percentile_approx at runtime with dynamically calculated percentage and accuracy.

    ref- apache spark git percentile_approx source

    0 讨论(0)
提交回复
热议问题