问题
Using R, I am running a logistic model and need to include an interaction term in the following fashion, where A is categorical, and B, continuous.
Y ~ A + B + normalized(B):A
My problem is that when I do so, the reference category is not the same as in
Y ~ A + B + A:B
which makes comparison of the models difficult. I am sure there is a way to force the reference category to be the same all the time, but can't seem to find a straightforward answer.
To illustrate, my data looks like this:
income ndvi sga
30,000$ - 49,999$ -0,141177617 0
30,000$ - 49,999$ -0,170513257 0
>80,000$ -0,054939323 1
>80,000$ -0,14724104 0
>80,000$ -0,207678157 0
missing -0,229890869 1
50,000$ - 79,999$ 0,245063253 0
50,000$ - 79,999$ 0,127565529 0
15,000$ - 29,999$ -0,145778357 0
15,000$ - 29,999$ -0,170944338 0
30,000$ - 49,999$ -0,121060635 0
30,000$ - 49,999$ -0,245407291 0
missing -0,156427532 0
>80,000$ 0,033541238 0
And the outputs are reproduced below. The first set of results is the form the model Y ~ A*B, and the second, Y ~ A + B + A:normalized(B)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.72175 0.29806 -9.132 <2e-16 ***
ndvi 2.78106 2.16531 1.284 0.1990
income15,000$ - 29,999$ -0.53539 0.46211 -1.159 0.2466
income30,000$ - 49,999$ -0.68254 0.39479 -1.729 0.0838 .
income50,000$ - 79,999$ -0.13429 0.33097 -0.406 0.6849
income>80,000$ -0.56692 0.35144 -1.613 0.1067
incomemissing -0.85257 0.47230 -1.805 0.0711 .
ndvi:income15,000$ - 29,999$ -2.27703 3.25433 -0.700 0.4841
ndvi:income30,000$ - 49,999$ -3.76892 2.86099 -1.317 0.1877
ndvi:income50,000$ - 79,999$ -0.07278 2.46483 -0.030 0.9764
ndvi:income>80,000$ -3.32489 2.62000 -1.269 0.2044
ndvi:incomemissing -3.98098 3.35447 -1.187 0.2353
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.07421 0.30680 -10.020 <2e-16 ***
ndvi -1.19992 2.56201 -0.468 0.640
income15,000$ - 29,999$ -0.33379 0.29920 -1.116 0.265
income30,000$ - 49,999$ -0.34885 0.26666 -1.308 0.191
income50,000$ - 79,999$ -0.12784 0.25124 -0.509 0.611
income>80,000$ -0.27255 0.27288 -0.999 0.318
incomemissing -0.50010 0.31299 -1.598 0.110
income<15,000$:normalize(ndvi) 0.40515 0.34139 1.187 0.235
income15,000$ - 29,999$:normalize(ndvi) 0.17341 0.35933 0.483 0.629
income30,000$ - 49,999$:normalize(ndvi) 0.02158 0.32280 0.067 0.947
income50,000$ - 79,999$:normalize(ndvi) 0.39774 0.28697 1.386 0.166
income>80,000$:normalize(ndvi) 0.06677 0.30087 0.222 0.824
incomemissing:normalize(ndvi) NA NA NA NA
So in the first model, the category "income<15,000" is the reference category, whereas in the second, something different happens, which I'm not all clear about yet.
回答1:
Let say that we would like to perform a regression on this equation

we tried to implement it using model.matrix
. But there is some automation problem illustrated in the results below. Is there a better way to implement it?. To be more specific let's say that X_1 is a continuous variable, while X_2 is a dummy.
Basically the interpretation of the interaction term would be the same, except that the main term X_2 would be evaluated when X_1 is at its mean. (see Early draft of this Paper)
Here are some data to illustrate my point:(It's not a glm but we can apply the same method to glm)
library(car)
str(Prestige)
# some data cleaning
Prestige <- Prestige[!is.na(Prestige$type),]
# interaction the usual way.
lm1 <- lm(income ~ education+ type + education:type, data = Prestige); summary(lm1)
# interacting with demeaned education
Prestige$education_ <- Prestige$education-mean(Prestige$education)
When using the regular formula method, things does not turn out the way we want. As formula does not put any variable as reference
lm2 <- lm(income ~ education+ type + education_:type, data = Prestige); summary(lm2)
# Using model.matrix to shape the interaction
cusInt <- model.matrix(~-1+education_:type,data=Prestige)[,-1];colnames(cusInt)
lm3 <- lm(income ~ education+ type + cusInt, data = Prestige); summary(lm3)
compareCoefs(lm1,lm3,lm2)
The results are here:
Est. 1 SE 1 Est. 2 SE 2 Est. 3 SE 3
(Intercept) -1865 3682 -1865 3682 4280 8392
education 866 436 866 436 297 770
typeprof -3068 7192 -542 1950 -542 1950
typewc 3646 9274 -2498 1377 -2498 1377
education:typeprof 234 617
education:typewc -569 885
cusInteducation_:typeprof 234 617
cusInteducation_:typewc -569 885
typebc:education_ 569 885
typeprof:education_ 803 885
typewc:education_
So basically when using model.matrix we have to intervene to set the reference variable. Besides there is some custInt appearing in front of the variable name so, formatting results when one have a lot of table to compare is quite tedious.
来源:https://stackoverflow.com/questions/12462289/forcing-a-reference-category-in-logistic-model-in-r