I am trying to do a percentile over a column using a Window function as below. I have referred here to use the ApproxQuantile
definition over a group.
I have a solution for you that is extremely unelegant and works only if you have a limited number of possible bucketing.
My first version is very ugly.
// for the sake of clarity, let's define a function that generates the
// window aggregation
def per(x : Int) = percentile_approx(col("count"), typedLit(doBucketing(x)))
.over(Window.partitionBy("ID"))
// then, we simply try to match the Bucket column with a possible value
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", when('Bucket === 2, per(2)
.otherwise(when('Bucket === 3, per(3))
.otherwise(per(4)))
)
That's nasty but it works in your case. Slightly less ugly but very same logic, you can define a set of possible numbers of buckets and use it to do the same thing as above.
val possible_number_of_buckets = 2 to 5
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", possible_number_of_buckets
.tail
.foldLeft(per(possible_number_of_buckets.head))
((column, size) => when('Bucket === size, per(size))
.otherwise(column)))
percentile_approx takes percentage and accuracy. It seems, they both must be a constant literal. Thus we can't compute the percentile_approx
at runtime with dynamically calculated percentage
and accuracy
.
ref- apache spark git percentile_approx source