I have an rdd of integers (i.e. RDD[Int]) and what I would like to do is to compute the following ten percentiles: [0th, 10th, 20th, ..., 90th, 100th]
You can :
To compute the median and the 99th percentile: getPercentiles(rdd, new double[]{0.5, 0.99}, size, numPartitions);
In Java 8:
public static double[] getPercentiles(JavaRDD rdd, double[] percentiles, long rddSize, int numPartitions) {
double[] values = new double[percentiles.length];
JavaRDD sorted = rdd.sortBy((Double d) -> d, true, numPartitions);
JavaPairRDD indexed = sorted.zipWithIndex().mapToPair((Tuple2 t) -> t.swap());
for (int i = 0; i < percentiles.length; i++) {
double percentile = percentiles[i];
long id = (long) (rddSize * percentile);
values[i] = indexed.lookup(id).get(0);
}
return values;
}
Note that this requires sorting the dataset, O(n.log(n)) and can be expensive on large datasets.
The other answer suggesting simply computing a histogram would not compute correctly the percentile: here is a counter example: a dataset composed of 100 numbers, 99 numbers being 0, and one number being 1. You end up with all the 99 0's in the first bin, and the 1 in the last bin, with 8 empty bins in the middle.