How to compute percentiles in Apache Spark

后端 未结 10 539
遥遥无期
遥遥无期 2020-12-04 22:08

I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]

10条回答
  •  南笙
    南笙 (楼主)
    2020-12-04 22:48

    You can :

    1. Sort the dataset via rdd.sortBy()
    2. Compute the size of the dataset via rdd.count()
    3. Zip with index to facilitate percentile retrieval
    4. Retrieve the desired percentile via rdd.lookup() e.g. for 10th percentile rdd.lookup(0.1 * size)

    To compute the median and the 99th percentile: getPercentiles(rdd, new double[]{0.5, 0.99}, size, numPartitions);

    In Java 8:

    public static double[] getPercentiles(JavaRDD rdd, double[] percentiles, long rddSize, int numPartitions) {
        double[] values = new double[percentiles.length];
    
        JavaRDD sorted = rdd.sortBy((Double d) -> d, true, numPartitions);
        JavaPairRDD indexed = sorted.zipWithIndex().mapToPair((Tuple2 t) -> t.swap());
    
        for (int i = 0; i < percentiles.length; i++) {
            double percentile = percentiles[i];
            long id = (long) (rddSize * percentile);
            values[i] = indexed.lookup(id).get(0);
        }
    
        return values;
    }
    

    Note that this requires sorting the dataset, O(n.log(n)) and can be expensive on large datasets.

    The other answer suggesting simply computing a histogram would not compute correctly the percentile: here is a counter example: a dataset composed of 100 numbers, 99 numbers being 0, and one number being 1. You end up with all the 99 0's in the first bin, and the 1 in the last bin, with 8 empty bins in the middle.

提交回复
热议问题