Why do stat_density (R; ggplot2) and gaussian_kde (Python; scipy) differ?

孤街浪徒 提交于 2019-12-02 05:50:35

There seems to be disagreement about what Silverman's Rule is.

The scipy docs say that Silverman's rule is implemented as:

def silverman_factor(self):
    return power(self.neff*(self.d+2.0)/4.0, -1./(self.d+4))

Where d is the number of dimensions (1, in your case) and neff is the effective sample size (number of points, assuming no weights). So the scipy bandwidth is (n * 3 / 4) ^ (-1 / 5) (times the standard deviation, computed in a different method).

By contrast, R's stats package docs describe Silverman's method as "0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power", which can also be verified in R code, typing bw.nrd0 in the console gives:

function (x) 
{
    if (length(x) < 2L) 
        stop("need at least 2 data points")
    hi <- sd(x)
    if (!(lo <- min(hi, IQR(x)/1.34))) 
        (lo <- hi) || (lo <- abs(x[1L])) || (lo <- 1)
    0.9 * lo * length(x)^(-0.2)
}

Wikipedia, on the other hand, gives "Silverman's Rule of Thumb" as one of many possible names for the estimator:

1.06 * sigma * n ^ (-1 / 5)

The wikipedia version is equivalent to the scipy version.

All three sources cite the same reference: Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman & Hall/CRC. p. 48. ISBN 978-0-412-24620-3. Wikipedia and R specifically cite page 48, whereas scipy's docs do not mention a page number. (I have submitted an edit to Wikipedia to update its page reference to p.45, see below.)


Addendum

I found a PDF of the Silverman reference.

On page 45, equation 3.28 is what is used in the Wikipedia article: (4 / 3) ^ (1 / 5) * sigma * n ^ (-1 / 5) ~= 1.06 * sigma * n ^ (-1 / 5). Scipy uses the same method, rewriting (4 / 3) ^ (1 / 5) as the equivalent (3 / 4) ^ (-1 / 5). Silverman describes this method:

While (3.28) will work well if the population really is normally distributed, it may oversmooth somewhat if the population is multimodal... as the mixture becomes more strongly bimodal the formula (3.28) will oversmooth more and more, relative to the optimal choice of smoothing parameter.

The scipy docs reference this weakness, stating:

It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

The Silverman article continues to motivate the method used by R and Stata. On page 48, we get the equation 3.31:

h = 0.9 * A * n ^ (-1 / 5)
# A defined on previous page, eqn 3.30
A = min(standard deviation, interquartile range / 1.34)

Silverman describes this method as:

The best of both possible worlds... In summary, the choice ([eqn] 3.31) for the smoothing parameter will do very well for a wide range of densities and is trivial to evaluate. For many purposes it will certainly be an adequate choice of window width, and for others it will be a good starting point for subsequent fine-tuning.

So, it seems Wikipedia and Scipy use a simple version of an estimator proposed by Silverman. R and Stata use a more refined version.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!