Forcing a reference category in logistic model in R

问题

Using R, I am running a logistic model and need to include an interaction term in the following fashion, where A is categorical, and B, continuous.

Y ~ A + B + normalized(B):A

My problem is that when I do so, the reference category is not the same as in

Y ~ A + B + A:B

which makes comparison of the models difficult. I am sure there is a way to force the reference category to be the same all the time, but can't seem to find a straightforward answer.

To illustrate, my data looks like this:

income                      ndvi        sga
30,000$ - 49,999$        -0,141177617        0
30,000$ - 49,999$        -0,170513257        0
>80,000$                 -0,054939323        1
>80,000$                 -0,14724104         0
>80,000$                 -0,207678157        0
missing                  -0,229890869        1
50,000$ - 79,999$         0,245063253        0
50,000$ - 79,999$         0,127565529        0
15,000$ - 29,999$        -0,145778357        0
15,000$ - 29,999$        -0,170944338        0
30,000$ - 49,999$        -0,121060635        0
30,000$ - 49,999$        -0,245407291        0
missing                  -0,156427532        0
>80,000$                  0,033541238        0

And the outputs are reproduced below. The first set of results is the form the model Y ~ A*B, and the second, Y ~ A + B + A:normalized(B)

                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                            -2.72175    0.29806  -9.132   <2e-16 ***
ndvi                                    2.78106    2.16531   1.284   0.1990    
income15,000$ - 29,999$                -0.53539    0.46211  -1.159   0.2466    
income30,000$ - 49,999$                -0.68254    0.39479  -1.729   0.0838 .  
income50,000$ - 79,999$                -0.13429    0.33097  -0.406   0.6849    
income>80,000$                         -0.56692    0.35144  -1.613   0.1067    
incomemissing                          -0.85257    0.47230  -1.805   0.0711 .  
ndvi:income15,000$ - 29,999$           -2.27703    3.25433  -0.700   0.4841    
ndvi:income30,000$ - 49,999$           -3.76892    2.86099  -1.317   0.1877    
ndvi:income50,000$ - 79,999$           -0.07278    2.46483  -0.030   0.9764    
ndvi:income>80,000$                    -3.32489    2.62000  -1.269   0.2044    
ndvi:incomemissing                     -3.98098    3.35447  -1.187   0.2353 

                                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)                              -3.07421    0.30680 -10.020   <2e-16 ***
ndvi                                     -1.19992    2.56201  -0.468    0.640    
income15,000$ - 29,999$                  -0.33379    0.29920  -1.116    0.265    
income30,000$ - 49,999$                  -0.34885    0.26666  -1.308    0.191    
income50,000$ - 79,999$                  -0.12784    0.25124  -0.509    0.611    
income>80,000$                           -0.27255    0.27288  -0.999    0.318    
incomemissing                            -0.50010    0.31299  -1.598    0.110    
income<15,000$:normalize(ndvi)            0.40515    0.34139   1.187    0.235    
income15,000$ - 29,999$:normalize(ndvi)   0.17341    0.35933   0.483    0.629    
income30,000$ - 49,999$:normalize(ndvi)   0.02158    0.32280   0.067    0.947    
income50,000$ - 79,999$:normalize(ndvi)   0.39774    0.28697   1.386    0.166    
income>80,000$:normalize(ndvi)            0.06677    0.30087   0.222    0.824    
incomemissing:normalize(ndvi)                  NA         NA      NA       NA

So in the first model, the category "income<15,000" is the reference category, whereas in the second, something different happens, which I'm not all clear about yet.

回答1:

Let say that we would like to perform a regression on this equation

we tried to implement it using model.matrix. But there is some automation problem illustrated in the results below. Is there a better way to implement it?. To be more specific let's say that X_1 is a continuous variable, while X_2 is a dummy.

Basically the interpretation of the interaction term would be the same, except that the main term X_2 would be evaluated when X_1 is at its mean. (see Early draft of this Paper)

Here are some data to illustrate my point:(It's not a glm but we can apply the same method to glm)

library(car)
str(Prestige)
# some data cleaning
Prestige <- Prestige[!is.na(Prestige$type),] 

# interaction the usual way.
lm1 <- lm(income ~ education+ type + education:type, data = Prestige); summary(lm1)

# interacting with demeaned education
Prestige$education_ <- Prestige$education-mean(Prestige$education)

When using the regular formula method, things does not turn out the way we want. As formula does not put any variable as reference

lm2 <- lm(income ~ education+ type + education_:type, data = Prestige); summary(lm2)

# Using model.matrix to shape the interaction
cusInt <- model.matrix(~-1+education_:type,data=Prestige)[,-1];colnames(cusInt)
lm3 <- lm(income ~ education+ type + cusInt, data = Prestige); summary(lm3)


compareCoefs(lm1,lm3,lm2)

The results are here:

                         Est. 1  SE 1 Est. 2  SE 2 Est. 3  SE 3
(Intercept)                -1865  3682  -1865  3682   4280  8392
education                    866   436    866   436    297   770
typeprof                   -3068  7192   -542  1950   -542  1950
typewc                      3646  9274  -2498  1377  -2498  1377
education:typeprof           234   617                          
education:typewc            -569   885                          
cusInteducation_:typeprof                 234   617             
cusInteducation_:typewc                  -569   885             
typebc:education_                                      569   885
typeprof:education_                                    803   885
typewc:education_

So basically when using model.matrix we have to intervene to set the reference variable. Besides there is some custInt appearing in front of the variable name so, formatting results when one have a lot of table to compare is quite tedious.

来源：https://stackoverflow.com/questions/12462289/forcing-a-reference-category-in-logistic-model-in-r

标签

interaction