NA values when regressing with dummy variable interaction term

风流意气都作罢 提交于 2020-12-30 03:36:05

问题


I'm trying to estimate factors that determine the difference in happiness level between people living in New York and Chicago.

Data looks like below.

  Happiness     City Gender Employment   Worktype      Holiday
1        60 New York      0        0     Unemployed   Unemployed
2        80  Chicago      1        1     Whitecolor 1 day a week
3        39  Chicago      0        0     Unemployed   Unemployed
4        40 New York      1        0     Unemployed   Unemployed
5        69  Chicago      1        1     Bluecolor  2 day a week
6        90  Chicago      1        1     Bluecolor  2 day a week
7       100 New York      0        1     Whitecolor 2 day a week
8        30 New York      1        1     Whitecolor 1 day a week

Happiness level is dependent variable, and 'city' is where the person lives. 'Gender' is coded 0 = man 1 = woman. 'Employment' is 0 = Unemployed and 1 = Employed. 'Worktype' is three level factor: 'Unemployed', 'Whitecolor', 'Bluecolor'. 'Holiday' is how many days a person rest in a week. Here 'City', 'Gender', 'Worktype' and 'Holiday' variables are all factors. 'Happiness' and 'Employment' variable types are numerical.

The Model I want to estimate is

lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))

I left 'Employment' value as numerical value so if 'Employment' is equal to 0(Unemployed), 0:(Worktype + Holiday) = 0, so the model is automatically reduced to

lm(Happiness ~ City + Gender)

for unemployed people.

However, regression result returns NA values.

Coefficients: (2 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                       56.75      23.56   2.408    0.138
CityNew York                     -14.50      27.21  -0.533    0.647
Gender1                           -2.25      35.99  -0.063    0.956
Employment:WorktypeBluecolor      25.00      43.02   0.581    0.620
Employment:WorktypeUnemployed        NA         NA      NA       NA
Employment:WorktypeWhitecolor     57.75      35.99   1.604    0.250
Employment:Holiday1 day a week   -50.00      54.42  -0.919    0.455
Employment:Holiday2 day a week       NA         NA      NA       NA

this seems to be due to 'Unemployment' value in 'Worktype' and 'Holiday' variable. However, I am not sure why R is not treating Employment:WorktypeUnemployed which is obviously 0:Worktype = 0 as zero and not removing it from the model. Is this because R is setting Employment:HolidayUnemployed as a baseline and both are perfectly multicollinear? (I had to put 'Unemployed' value for 'Worktype' and 'Holiday' because I wanted to see the effect of 'Worktype' and 'Holiday' compared to 'Unemployed' people. If I remove 'Unemployed' value NA disappears, but baseline will be 'Whitecolor' and '1day a week' so I cannot see the effect compared to 'unemployed'.)

If so, Why am I getting NA for coefficients for 'Employement:Holiday2 day a week'? It seems that it has nothing to do with 'Unemployed' value.

Can I rely on this result while just removing NA coefficients?

below are reproducible code.

Happiness <- c(60, 80, 39, 40, 69, 90, 100, 30)

City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago",         
                  "Chicago", "New York", "New York"))
Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
Worktype <- as.factor(c("Unemployed", "Whitecolor", "Unemployed",     
          "Unemployed", "Bluecolor", "Bluecolor", "Whitecolor","Whitecolor"))
Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")

data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)

head(data,8)
str(data)

reg <- lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
summary(reg)

回答1:


You shouldn't worry about the NA values for Employment:WorktypeUnemployed. R tries automatically to compute all the interactions, but that particular coefficient remains undetermined because, clearly, it is never the case that Employment=1 and Worktype="Unemployed". It does not have any effect on the computations of the other coefficients: you can verify by manually coding the dummy variables:

> library(lme4) # for the convenient "dummy" function 
> data <- data.frame(data, 
+   dummy(Worktype, c("Bluecolor","Whitecolor")), 
+   h1=dummy(Holiday)[,1], 
+   h2=dummy(Holiday)[,2])
>   
> reg <- lm(Happiness ~ City + Gender + Employment:Bluecolor + Employment:Whitecolor  + Employment:h1 + Employment:h2 , data)
> summary(reg)

Call:
lm(formula = Happiness ~ City + Gender + Employment:Bluecolor + 
    Employment:Whitecolor + Employment:h1 + Employment:h2, data = data)

Residuals:
         1          2          3          4          5          6          7          8 
 1.775e+01  1.775e+01 -1.775e+01  8.882e-16 -1.050e+01  1.050e+01  4.441e-15 -1.775e+01 

Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)              56.75      23.56   2.408    0.138
CityNew York            -14.50      27.21  -0.533    0.647
Gender1                  -2.25      35.99  -0.063    0.956
Employment:Bluecolor     25.00      43.02   0.581    0.620
Employment:Whitecolor    57.75      35.99   1.604    0.250
Employment:h1           -50.00      54.42  -0.919    0.455
Employment:h2               NA         NA      NA       NA

Residual standard error: 27.21 on 2 degrees of freedom
Multiple R-squared:  0.6798,    Adjusted R-squared:  -0.1208 
F-statistic: 0.8491 on 5 and 2 DF,  p-value: 0.619

The estimated coefficients are identical even though Employment:WorktypeUnemployed is not present anymore.

However, the NA values are still present for Employment:h2 (equivalent to Employment:Holiday2 day a week). This seems due to the fact that in this reduced dataset you end up with a singular model matrix (i.e. one column is a linear combination of other columns)

> solve(crossprod(model.matrix(reg)))
Error in solve.default(crossprod(model.matrix(reg))) : 
  system is computationally singular: reciprocal condition number = 1.79897e-18

So this issue may not be present with a larger dataset. Eventually, you could try to drop any redundancy in the model (e.g., are there any employed with 0 days per week of holiday? if not then 1 day should be the baseline, and you would add extra columns to code for days of holiday >1). You can use the alias() function to check which term is giving the issue.



来源:https://stackoverflow.com/questions/47976109/na-values-when-regressing-with-dummy-variable-interaction-term

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!