问题
I am using the twang package to create propensity scores, which are used as weights in a binomial glm using survey::svyglm. The code looks something like this:
pscore <- ps(ppci ~ var1+var2+.........., data=dt....)
dt$w <- get.weights(pscore, stop.method="es.mean")
design.ps <- svydesign(ids=~1, weights=~w, data=dt,)
glm1 <- svyglm(m30 ~ ppci, design=design.ps,family=binomial)
This produces the following warning:
Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
Does anyone know what I could be doing wrong ?
I wasn't sure if this message would be better on stats.SE, but on balance I thought I would try here first.
回答1:
There's nothing wrong, glm is just picky when it comes to specifying binomial (and Poisson) models. It warns if it detects that the no. of trials or successes is non-integral, but it goes ahead and fits the model anyway. If you want to suppress the warning (and you're sure it's not a problem), use family=quasibinomial instead.
回答2:
There is nothing wrong, computationally, but statistically you may not be doing something that makes much sense. In such a case, it is probably better to use a robust regression method, which is generally a good idea for proportional response data if your data include units with exactly 1 or exactly 0.
回答3:
Like @HoongOoi said, glm.fit with binomial family expects integer counts and throws a warning otherwise; if you want non-integer counts, use quasi-binomial. The rest of my answer compares these.
Quasi-binomial in R for glm.fit is exactly the same as binomial for the coefficient estimates (as mentioned in comments by @HongOoi) but not for the standard errors (as mentioned in the comment by @nograpes).
Comparison of source code
A diff on the source code of stats::binomial and stats::quasibinomial shows the following changes:
- the text "binomial" becomes "quasibinomial"
- the aic function returns NA instead of the calculated AIC
and the following removals:
- setting outcomes to 0 when weights = 0
- check on integrality of weights
simfunfunction to simulate data
Only simfun could make a difference, but the source code of glm.fit shows no use of that function, unlike other fields in the object returned by stats::binomial such as mu.eta and link.
Minimal working example
The results from using quasibinomial or binomial are the same for the coefficients in this minimal working example:
library('MASS')
library('stats')
gen_data <- function(n=100, p=3) {
set.seed(1)
weights <- stats::rgamma(n=n, shape=rep(1, n), rate=rep(1, n))
y <- stats::rbinom(n=n, size=1, prob=0.5)
theta <- stats::rnorm(n=p, mean=0, sd=1)
means <- colMeans(as.matrix(y) %*% theta)
x <- MASS::mvrnorm(n=n, means, diag(1, p, p))
return(list(x=x, y=y, weights=weights, theta=theta))
}
fit_glm <- function(family) {
data <- gen_data()
fit <- stats::glm.fit(x = data$x,
y = data$y,
weights = data$weights,
family = family)
return(fit)
}
fit1 <- fit_glm(family=stats::binomial(link = "logit"))
fit2 <- fit_glm(family=stats::quasibinomial(link = "logit"))
all(fit1$coefficients == fit2$coefficients)
Comparison with the quasibinomial probability distribution
This thread suggests that the quasibinomial distribution is different from the binomial distribution with an additional parameter phi. But they mean different things in statistics and in R.
First, no place in the source code of quasibinomial mentions that additional phi parameter.
Second, a quasiprobability is similar to a probability, but not a proper one. In this case, one cannot compute the term (n \choose k) when the numbers are non-integers, although one could with the Gamma function. This may be a problem for the definition of the probability distribution but is irrelevant for estimation, as the term (n choose k) do not depend on the parameter and fall out of optimisation.
The log-likelihood estimator is:
The log-likelihood as a function of theta with the binomial family is:
where the constant is independent of the parameter theta, so it falls out of optimisation.
Comparison of standard errors
The standard errors calculated by stats::summary.glm use a different dispersion value for the binomial and quasibinomial families, as mentioned in stats::summary.glm:
The dispersion of a GLM is not used in the fitting process, but it is needed to find standard errors. If
dispersionis not supplied orNULL, the dispersion is taken as1for thebinomialandPoissonfamilies, and otherwise estimated by the residual Chisquared statistic (calculated from cases with non-zero weights) divided by the residual degrees of freedom....
cov.unscaled: the unscaled (dispersion = 1) estimated covariance matrix of the estimated coefficients.
cov.scaled: ditto, scaled bydispersion.
Using the the above minimal working example:
summary1 <- stats::summary.glm(fit1)
summary2 <- stats::summary.glm(fit2)
print("Equality of unscaled variance-covariance-matrix:")
all(summary1$cov.unscaled == summary2$cov.unscaled)
print("Equality of variance-covariance matrix scaled by `dispersion`:")
all(summary1$cov.scaled == summary2$cov.scaled)
print(summary1$coefficients)
print(summary2$coefficients)
shows the same coefficients, same unscaled variance-covariance matrix, and different scaled variance-covariance matrices:
[1] "Equality of unscaled variance-covariance-matrix:"
[1] TRUE
[1] "Equality of variance-covariance matrix scaled by `dispersion`:"
[1] FALSE
Estimate Std. Error z value Pr(>|z|)
[1,] -0.3726848 0.1959110 -1.902317 0.05712978
[2,] 0.5887384 0.2721666 2.163155 0.03052930
[3,] 0.3161643 0.2352180 1.344133 0.17890528
Estimate Std. Error t value Pr(>|t|)
[1,] -0.3726848 0.1886017 -1.976042 0.05099072
[2,] 0.5887384 0.2620122 2.246988 0.02690735
[3,] 0.3161643 0.2264421 1.396226 0.16583365
回答4:
Sorry, but it is more robust in the sense that if the underlying mechanism is an overdispersed binomial model, the overdispersed binomial will account for it when estimating the standard erorr. Hence, you will get better coverage, even though point estimates are the same.
来源:https://stackoverflow.com/questions/12953045/warning-non-integer-successes-in-a-binomial-glm-survey-packages