Get p-value for group mean difference without refitting linear model with a new reference level

问题

When we have a linear model with a factor variable X (with levels A, B, and C)

y ~ factor(X) + Var2 + Var3

The result shows the estimate XB and XC which is differences B - A and C - A. (suppose that the reference is A).

If we want to know the p-value of the difference between B and C: C - B, we should designate B or C as a reference group and re-run the model.

Can we get the p-values of the effect B - A, C - A, and C - B at one time?

回答1:

You are looking for linear hypothesis test by check p-value of some linear combination of regression coefficients. Based on my answer: How to conduct linear hypothesis test on regression coefficients with a clustered covariance matrix?, where we only considered sum of coefficients, I will extend the function LinearCombTest to handle more general cases, supposing alpha as some combination coefficients of variables in vars:

LinearCombTest <- function (lmObject, vars, alpha, .vcov = NULL) {
  ## if `.vcov` missing, use the one returned by `lm`
  if (is.null(.vcov)) .vcov <- vcov(lmObject)
  ## estimated coefficients
  beta <- coef(lmObject)
  ## linear combination of `vars` with combination coefficients `alpha`
  LinearComb <- sum(beta[vars] * alpha)
  ## get standard errors for sum of `LinearComb`
  LinearComb_se <- sum(alpha * crossprod(.vcov[vars, vars], alpha)) ^ 0.5
  ## perform t-test on `sumvars`
  tscore <- LinearComb / LinearComb_se
  pvalue <- 2 * pt(abs(tscore), lmObject$df.residual, lower.tail = FALSE)
  ## return a matrix
  form <- paste0("(", paste(alpha, vars, sep = " * "), ")")
  form <- paste0(paste0(form, collapse = " + "), " = 0")
  matrix(c(LinearComb, LinearComb_se, tscore, pvalue), nrow = 1L,
         dimnames = list(form, c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
  }

Consider a simple example, where we have a balanced design for three groups A, B and C, with group mean 0, 1, 2, respectively.

x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1

fit <- lm(y ~ x)
coef(summary(fit))

#             Estimate Std. Error   t value     Pr(>|t|)
#(Intercept) 0.1226684 0.09692277  1.265631 2.066372e-01
#xB          0.9317800 0.13706949  6.797866 5.823987e-11
#xC          2.0445528 0.13706949 14.916177 6.141008e-38

Since A is the reference level, xB is giving B - A while xC is giving C - A. Suppose we are now interested in the difference between group B and C, i.e., C - B, we can use

LinearCombTest(fit, c("xC", "xB"), c(1, -1))

#                         Estimate Std. Error  t value     Pr(>|t|)
#(1 * xC) + (-1 * xB) = 0 1.112773  0.1370695 8.118312 1.270686e-14

Note, this function is also handy to work out the group mean of B and C, that is (Intercept) + xB and (Intercept) + xC:

LinearCombTest(fit, c("(Intercept)", "xB"), c(1, 1))

#                                 Estimate Std. Error  t value     Pr(>|t|)
#(1 * (Intercept)) + (1 * xB) = 0 1.054448 0.09692277 10.87926 2.007956e-23

LinearCombTest(fit, c("(Intercept)", "xC"), c(1, 1))

#                                 Estimate Std. Error  t value     Pr(>|t|)
#(1 * (Intercept)) + (1 * xC) = 0 2.167221 0.09692277 22.36029 1.272811e-65

Alternative solution with lsmeans

Consider the above toy example again:

library(lsmeans)
lsmeans(fit, spec = "x", contr = "revpairwise")

#$lsmeans
# x    lsmean         SE  df    lower.CL  upper.CL
# A 0.1226684 0.09692277 297 -0.06807396 0.3134109
# B 1.0544484 0.09692277 297  0.86370603 1.2451909
# C 2.1672213 0.09692277 297  1.97647888 2.3579637
#
#Confidence level used: 0.95 
#
#$contrasts
# contrast estimate        SE  df t.ratio p.value
# B - A    0.931780 0.1370695 297   6.798  <.0001
# C - A    2.044553 0.1370695 297  14.916  <.0001
# C - B    1.112773 0.1370695 297   8.118  <.0001
#
#P value adjustment: tukey method for comparing a family of 3 estimates

The $lsmeans domain returns the marginal group mean, while $contrasts returns pairwise group mean difference, since we have used "revpairwise" contrast. Read p.32 of lsmeans for difference between "pairwise" and "revpairwise".

Well this is certainly interesting, as we can compare with the result from LinearCombTest. We see that LinearCombTest is doing correctly.

回答2:

glht (general linear hypothesis testing) from multcomp package makes this sort of multiple hypothesis test easy without re-running a bunch of separate models. It is essentially crafting a customized contrast matrix based on your defined comparisons of interest.

Using your example comparisons and building on the data @ZheyuanLi provided:

x <- gl(3,100,labels = LETTERS[1:3])
set.seed(0)
y <- c(rnorm(100, 0), rnorm(100, 1), rnorm(100, 2)) + 0.1

fit <- lm(y ~ x)

library(multcomp)
my_ht <- glht(fit, linfct = mcp(x = c("B-A = 0",
                             "C-A = 0",
                             "C-B = 0")))

summary(my_ht) will give you the adjusted p-values for the comparisons of interest.

#Linear Hypotheses:
#           Estimate Std. Error t value Pr(>|t|)    
#B - A == 0   0.9318     0.1371   6.798 1.11e-10 ***
#C - A == 0   2.0446     0.1371  14.916  < 1e-10 ***
#C - B == 0   1.1128     0.1371   8.118  < 1e-10 ***

回答3:

You could use the library car, and use the function linearHypothesis with the parameter vcov.

Set this as the variance-covariance matrix of your model.

The function takes formulas or a matrix to describe the system of equations that you would like to test.

来源：https://stackoverflow.com/questions/40143388/get-p-value-for-group-mean-difference-without-refitting-linear-model-with-a-new

标签

regression

linear-regression

hypothesis-test