How to write custom function from percentile_approx code which gives as equal result as percentile.inc in excel?

问题

I am using spark-sql-2.4.1v with Java 8. I need to calculate percentiles such as 25,75,90 for some given data.

I tried using percentile_approx() from Spark-sql to do this. But the results of percentile_approx() are not matching the fractional percentiles of excel sheet which uses PERCENTILE.INC().

Hence, I'm wondering how to fix or adjust the percentile_approx() function. Is there anyway to overwrite or write a custom function modifying percentile_approx() which calculates fractional percentiles correctly? How to write/modify percentile_approx()?

Given dataset:

val df = Seq(
    (10, "1/15/2018", 0.010680705, 10,0.619875458, "east"),
    (10, "1/15/2018", 0.006628853,  4,0.16039063, "west"),
    (10, "1/15/2018", 0.01378215,  20,0.082049528, "east"),
    (10, "1/15/2018", 0.810680705,  6,0.819875458, "west"),
    (10, "1/15/2018", 0.702228853, 30,0.916039063, "east"))     
  .toDF("id", "date", "revenue", "con_dist_1", "con_dist_2", "zone")


val percentiles = Seq(0.25, 0.75,0.90)  // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2")  // The columns to use

I need to calculate the given percentiles for each zone for the given columns. How can this be achieved?

Expected results:

+---+---------+-----------+----+------------+--------------+--------------+-------------+
| id|     date|    revenue|zone|perctile_col|qunantile_0.25|qunantile_0.75|qunantile_0.9|
+---+---------+-----------+----+------------+--------------+--------------+-------------+
| 10|1/15/2018|0.006628853|west|  con_dist_1|           4.5|           5.5|          5.8|
| 10|1/15/2018|0.010680705|west|  con_dist_1|           4.5|           5.5|          5.8|
| 10|1/15/2018|0.010680705|east|  con_dist_1|            15|            25|         28.0|
| 10|1/15/2018| 0.01378215|east|  con_dist_1|            15|            25|         28.0|
| 10|1/15/2018|0.006628853|east|  con_dist_1|            15|            25|         28.0|
| 10|1/15/2018|0.006628853|west|  con_dist_2|   0.325261837|   0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|west|  con_dist_2|   0.325261837|   0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|east|  con_dist_2|   0.350962493|  0.4990442955|  0.749241156|
| 10|1/15/2018| 0.01378215|east|  con_dist_2|   0.350962493|  0.4990442955|  0.749241156|
| 10|1/15/2018|0.006628853|east|  con_dist_2|   0.350962493|  0.4990442955|  0.749241156|
+---+---------+-----------+----+------------+--------------+--------------+-------------+

You can verify the results with "definition 2" of this url https://www.translatorscafe.com/unit-converter/en-US/calculator/percentile/

回答1:

A naive way of solving this using Spark would be to manually find the two closest values to the specified percentile value. Then the fractional part can easily be calculated.

In Scala this can be done as follows:

First we get the ranking of each row grouped by zone and divide by the maximum rank of each group.

val w = Window.partitionBy($"zone").orderBy($"date")
val df_zone = df.withColumn("zone_rn", row_number().over(w) - 1)
  .withColumn("zone_rn", $"zone_rn" / max($"zone_rn").over(w))

This gives:

+---+---------+-----------+----------+-----------+----+-------+
|id |date     |revenue    |con_dist_1|con_dist_2 |zone|zone_rn|
+---+---------+-----------+----------+-----------+----+-------+
|10 |1/15/2018|0.006628853|4         |0.16039063 |west|0.0    |
|10 |1/15/2018|0.810680705|6         |0.819875458|west|1.0    |
|10 |1/15/2018|0.010680705|10        |0.619875458|east|0.0    |
|10 |1/15/2018|0.01378215 |20        |0.082049528|east|0.5    |
|10 |1/15/2018|0.702228853|30        |0.916039063|east|1.0    |
+---+---------+-----------+----------+-----------+----+-------+

We loop over all the columns to consider and do a foldLeft over the percentiles to add the lower and upper bounds for each (lower_val and upper_val). We compute the fraction at the same time and then the quantile value by adding the fraction to the lower bound.

Finally, since we looped over the columns, we use reduce(_.union(_)) to bring everything back to a single dataframe.

val percentiles = Seq(0.25, 0.75, 0.90)     // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2")  // The columns to use

val df_percentiles = cols.map{ c => 
    percentiles.foldLeft(df_zone){ case(df, p) =>  
      df.withColumn("perctile_col", lit(c))
        .withColumn("zone_lower", max(when($"zone_rn" <= p, $"zone_rn")).over(w))
        .withColumn("zone_upper", min(when($"zone_rn" >= p, $"zone_rn")).over(w))
        .withColumn("lower_val", max(when($"zone_lower" === $"zone_rn", col(c))).over(w))
        .withColumn("upper_val", min(when($"zone_upper" === $"zone_rn", col(c))).over(w))
        .withColumn("fraction", (lit(p) - $"zone_lower") / ($"zone_upper" - $"zone_lower"))
        .withColumn(s"quantile_$p", $"lower_val" + $"fraction" * ($"upper_val" - $"lower_val"))
  }
  .drop((cols ++ Seq("zone_rn", "zone_lower", "zone_upper", "lower_val", "upper_val", "fraction")): _*)
}.reduce(_.union(_))

Result:

+---+---------+-----------+----+------------+-------------+------------------+------------------+
| id|     date|    revenue|zone|perctile_col|quantile_0.25|     quantile_0.75|      quantile_0.9|
+---+---------+-----------+----+------------+-------------+------------------+------------------+
| 10|1/15/2018|0.006628853|west|  con_dist_1|          4.5|               5.5|               5.8|
| 10|1/15/2018|0.810680705|west|  con_dist_1|          4.5|               5.5|               5.8|
| 10|1/15/2018|0.010680705|east|  con_dist_1|         15.0|              25.0|              28.0|
| 10|1/15/2018| 0.01378215|east|  con_dist_1|         15.0|              25.0|              28.0|
| 10|1/15/2018|0.702228853|east|  con_dist_1|         15.0|              25.0|              28.0|
| 10|1/15/2018|0.006628853|west|  con_dist_2|  0.325261837|0.6550042509999999|      0.7539269752|
| 10|1/15/2018|0.810680705|west|  con_dist_2|  0.325261837|0.6550042509999999|      0.7539269752|
| 10|1/15/2018|0.010680705|east|  con_dist_2|  0.350962493|      0.4990442955|0.7492411560000001|
| 10|1/15/2018| 0.01378215|east|  con_dist_2|  0.350962493|      0.4990442955|0.7492411560000001|
| 10|1/15/2018|0.702228853|east|  con_dist_2|  0.350962493|      0.4990442955|0.7492411560000001|
+---+---------+-----------+----+------------+-------------+------------------+------------------+

来源：https://stackoverflow.com/questions/61155679/how-to-write-custom-function-from-percentile-approx-code-which-gives-as-equal-re

标签

apache-spark

java-8

Hive

apache-spark-sql

statistics