Java Apache Commons getPercentile() different result that MS Excel percentile

白昼怎懂夜的黑 提交于 2019-12-05 02:23:58
Nick Fortescue

The difference is subtle and due to assumptions. It is easiest to explain with the 3 element case. Suppose you have three elements(N=3) a=x[0] < b=x[1] < c=x[2]. Both the Apache and the Excel method say that element b is the 50th percentile (the median). However they differ for a and c.

The Apache method (and the method referenced by the NIST page) say a is the 25th percentile and c is the 75% percentile, because it divides the space up into N+1 blocks, that is, into quarters.

The Excel method says that a is the 0th percentile and c the 100th percentile, as the space is divided into N-1 blocks, that is, in half.

Because of this, if you want the Excel method and you don't want to code it yourself, you could just remove the smallest and the largest element from your array, and call the Apache method - it should give you exactly the same result except at percentiles beyond the end points.

If you want to code it yourself, an easy way is given below. Be aware of these issues:

  • this sorts the array (so changes it)
  • this takes O(N log(N)) time due to the sorting. The Apache method uses a fast selection algorithm so takes O(N) time (google "quickselect" if you want to know more)

Code (not tested or even compiled, but should give you an idea).

// warning - modifies data 
double excelPercentile(double [] data, double percentile) { array
    Arrays.sort(data);
    double index = percentile*(data.length-1);
    int lower = (int)Math.floor(index);
    if(lower<0) { // should never happen, but be defensive
       return data[0];
    }
    if(lower>=data.length-1) { // only in 100 percentile case, but be defensive
       return data[data.length-1);
    }
    double fraction = index-lower;
    // linear interpolation
    double result=data[lower] + fraction*(data[lower+1]-data[lower]);
    return result;
 }

The solution was creating a class PercentileExcel which is almost a copy of percentile from commons method except for a small change on how to caculate the position:

pos=(1+p*(n-1))/100;

Then you need to add this line to the code in order to use the new class for percentile:

setPercentileImpl(PercentileExcel);

There is no unique definition of a percentile computed from a data set. See the Wikipedia page for the most common definitions in use.

pbirnie

Class org.apache.commons.math3.stat.descriptive.rank.Percentile already supports Excel style interpolation, you just need to enable it with EstimationType.R_7

public class PercentileExcel extends Percentile {
    public PercentileExcel() throws MathIllegalArgumentException {

    super(50.0,
          EstimationType.R_7, // use excel style interpolation
          NaNStrategy.REMOVED,
          new KthSelector(new MedianOf3PivotingStrategy()));
    }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!