lower and upper quartiles in boxplot in R

后端 未结 3 1179
温柔的废话
温柔的废话 2020-12-06 18:32

I have

X=c(20 ,18, 34, 45, 30, 51, 63, 52, 29, 36, 27, 24)

With boxplot, i\'m trying to plot the quantile(X,0.25)

3条回答
  •  失恋的感觉
    2020-12-06 19:07

    The discrepancy arises from an ambiguity in the definition of quantiles. No single method is strictly correct or incorrect - there are simply different ways to estimate quantiles in situations (such as an an even number of data points) when they do not neatly coincide with a specific data point and must be interpolated. Somewhat disconcertingly, boxplot and quantile (and other functions that provide summary statistics) use different default methods to calculate quantiles, although these defaults can be over-ridden using the type = argument in quantile

    We can see these differences more clearly in action by looking at some of the various ways to generate quantile statistics in R.

    Both boxplot and fivenum give the same values:

    boxplot.stats(X)$stats
    # [1] 18.0 25.5 32.0 48.0 63.0
    fivenum(X)
    # [1] 18.0 25.5 32.0 48.0 63.0
    

    In boxplot and fivenum, the lower (upper) quartile is equivalent to the median of the lower (upper) half of the data (including the median of the complete data):

    c(median(X[ X <= median(X) ]), median(X[ X >= median(X) ]))
    # [1] 25.5  48.0
    

    But, quartile and summary do things differently:

    summary(X)
    #  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    # 18.00   26.25   32.00   35.75   46.50   63.00
    
    quantile(X, c(0.25,0.5,0.75))
    #   25%   50%   75% 
    # 26.25 32.00 46.50
    

    The difference between this and the results from boxplot and fivenum hinges on how the functions interpolate between data. quartile attempts to interpolate by estimating the shape of the cumulative distribution function. According to ?quantile:

    quantile returns estimates of underlying distribution quantiles based on one or two order statistics from the supplied elements in x at probabilities in probs. One of the nine quantile algorithms discussed in Hyndman and Fan (1996), selected by type, is employed.

    The full details of the nine different methods quantile employs to estimate the distribution function of the data can be found in ?quantile, and are too lengthy to reproduce in full here. The important point to note is that the 9 methods are taken from Hyndman and Fan (1996) who recommended type 8. The default method used by quantile is type 7, for historical reasons of compatibility with S. We can see the estimates of the quartiles provided by different methods in quantile using:

    quantile_methods = data.frame(q25 = sapply(1:9, function(method) quantile(X, 0.25, type = method)),
               q50 = sapply(1:9, function(method) quantile(X, 0.50, type = method)),
               q75 = sapply(1:9, function(method) quantile(X, 0.75, type = method)))
    #       q25 q50    q75
    # 1 24.0000  30 45.000
    # 2 25.5000  32 48.000
    # 3 24.0000  30 45.000
    # 4 24.0000  30 45.000
    # 5 25.5000  32 48.000
    # 6 24.7500  32 49.500
    # 7 26.2500  32 46.500
    # 8 25.2500  32 48.500
    # 9 25.3125  32 48.375
    

    In which type = 5 provides the same estimated values of the quartiles as does boxplot. However, when there are an odd number of data, it is type=7 that will coincide with boxplot stats.

    We can show this works by automatically selecting the type to be either 5 or 7 depending on whether there are an odd or even number of data. Boxplot in image below show quantiles for data sets with 1 to 30 values, with boxplot and quantile giving the same values for both odd and even N:

    layout(matrix(1:30,5,6, byrow = T), respect = T)
    par(mar=c(0.2,0.2,0.2,0.2), bty="n", yaxt="n", xaxt="n")
    
    for (N in 1:30){
      X = sample(100, N)
      boxplot(X)
      abline(h=quantile(X, c(0.25, 0.5, 0.75), type=c(5,7)[(N %% 2) + 1]), col="red", lty=2)
    }
    


    Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, American Statistician 50, 361–365

提交回复
热议问题