Finding Percentile in Spark-Scala per a group

前端未结

关注

 2  1353

I am trying to do a percentile over a column using a Window function as below. I have referred here to use the ApproxQuantile definition over a group.

相关标签:

2条回答

清酒与你

2020-12-06 15:50

I have a solution for you that is extremely unelegant and works only if you have a limited number of possible bucketing.

My first version is very ugly.

// for the sake of clarity, let's define a function that generates the
// window aggregation
def per(x : Int) = percentile_approx(col("count"), typedLit(doBucketing(x)))
                        .over(Window.partitionBy("ID"))

// then, we simply try to match the Bucket column with a possible value
val res = df1
    .join(idBucketMapping, Seq("ID"))
    .withColumn("percentile", when('Bucket === 2, per(2)
                     .otherwise(when('Bucket === 3, per(3))
                     .otherwise(per(4)))
    )

That's nasty but it works in your case. Slightly less ugly but very same logic, you can define a set of possible numbers of buckets and use it to do the same thing as above.

val possible_number_of_buckets = 2 to 5

val res = df1
    .join(idBucketMapping, Seq("ID"))
    .withColumn("percentile", possible_number_of_buckets
                .tail
                .foldLeft(per(possible_number_of_buckets.head))
                         ((column, size) => when('Bucket === size, per(size))
                                              .otherwise(column)))

0 讨论(0)

再見小時候

2020-12-06 16:03

percentile_approx takes percentage and accuracy. It seems, they both must be a constant literal. Thus we can't compute the percentile_approx at runtime with dynamically calculated percentage and accuracy.

ref- apache spark git percentile_approx source

0 讨论(0)
发布评论:

提交评论
- 加载中...