Factor levels default to 1 and 2 in R | Dummy variable

荒凉一梦 提交于 2019-12-03 09:39:14

In short, you are just mixing up two different concepts. I will clarify them one by one in the following.


The meaning of integers you see in str()

What you see from str() is the internal representation of a factor variable. A factor is internally an integer, where the number gives the position of levels inside the vector. For example:

x <- gl(3, 2, labels = letters[1:3])
#[1] a a b b c c
#Levels: a b c

storage.mode(x)  ## or `typeof(x)`
#[1] "integer"

str(x)
# Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3

as.integer(x)
#[1] 1 1 2 2 3 3

levels(x)
#[1] "a" "b" "c"

A common use of such positions, is to perform as.character(x) in the most efficient way:

levels(x)[x]
#[1] "a" "a" "b" "b" "c" "c"

Your misunderstanding of what a model matrix looks like

It seems to me that you thought a model matrix is obtained by

cbind(1L, as.integer(x))
#     [,1] [,2]
#[1,]    1    1
#[2,]    1    1
#[3,]    1    2
#[4,]    1    2
#[5,]    1    3
#[6,]    1    3

which is not true. In this fashion, you are just treating a factor variable as a numerical variable.

The model matrix is constructed this way:

xlevels <- levels(x)
cbind(1L, match(x, xlevels[2], nomatch=0), match(x, xlevels[3], nomatch=0))
#     [,1] [,2] [,3]
#[1,]    1    0    0
#[2,]    1    0    0
#[3,]    1    1    0
#[4,]    1    1    0
#[5,]    1    0    1
#[6,]    1    0    1

The 1 and 0 implies "match" / "occurrence" and "no-match" / "no-occurrence", respectively.

The R routine model.matrix will do this for you efficiently, with easy-to-read column names and row names:

model.matrix(~x)
#  (Intercept) xb xc
#1           1  0  0
#2           1  0  0
#3           1  1  0
#4           1  1  0
#5           1  0  1
#6           1  0  1

Write an R function to produce a model matrix ourselves

We could write a nominal routine mm to generate a model matrix. Though it is much less efficient than model.matrix, it may help one digest this concept better.

mm <- function (x, contrast = TRUE) {
  xlevels <- levels(x)
  lst <- lapply(xlevels, function (z) match(x, z, nomatch = 0L))
  if (contrast) do.call("cbind", c(list(1L), lst[-1]))
  else do.call("cbind", lst)
  }

For example, if we have a factor y with 5 levels:

set.seed(1); y <- factor(sample(1:5, 10, replace=TRUE), labels = letters[1:5])
y
# [1] b b c e b e e d d a
#Levels: a b c d e
str(y)
#Factor w/ 5 levels "a","b","c","d",..: 2 2 3 5 2 5 5 4 4 1

Its model matrix with / without contrast treatment is respectively:

mm(y, TRUE)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    0    0    0
# [2,]    1    1    0    0    0
# [3,]    1    0    1    0    0
# [4,]    1    0    0    0    1
# [5,]    1    1    0    0    0
# [6,]    1    0    0    0    1
# [7,]    1    0    0    0    1
# [8,]    1    0    0    1    0
# [9,]    1    0    0    1    0
#[10,]    1    0    0    0    0

mm(y, FALSE)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    0    1    0    0    0
# [2,]    0    1    0    0    0
# [3,]    0    0    1    0    0
# [4,]    0    0    0    0    1
# [5,]    0    1    0    0    0
# [6,]    0    0    0    0    1
# [7,]    0    0    0    0    1
# [8,]    0    0    0    1    0
# [9,]    0    0    0    1    0
#[10,]    1    0    0    0    0

The corresponding model.matrix call will be respectively:

model.matrix(~ y)
model.matrix(~ y - 1)

R is not Stata. And you will need to unlearn a lot of what you have been taught about dummy variable construction. R does it behind the scenes for you. You cannot make R behave exactly as Stata. True, R did have 0's and 1' in the model matrix column for the "F" level but those get multiplied by the factor values, (1 and 2 in this case). However, contrasts are always about differences and the difference btwn (0,1) is the same as the difference btwn (1,2).

A data example:

dput(dat)
structure(list(total = c(357L, 138L, 172L, 272L, 149L, 113L), 
    gender = structure(c(2L, 2L, 1L, 1L, 1L, 1L), .Label = c("F", 
    "M"), class = "factor")), .Names = c("total", "gender"), row.names = c("1", 
"2", "3", "4", "5", "6"), class = "data.frame")

These two regression models have different model matrices (model matrices are how R constructs its "dummy variables.

> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderM 
      176.5        71.0 
> dat$gender=factor(dat$gender, levels=c("M","F") )
> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderF 
      247.5       -71.0 
> model.matrix(myfit)
  (Intercept) genderF
1           1       0
2           1       0
3           1       1
4           1       1
5           1       1
6           1       1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"

> dat$gender=factor(dat$gender, levels=c("F","M") )
> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderM 
      176.5        71.0 
> model.matrix(myfit)
  (Intercept) genderM
1           1       1
2           1       1
3           1       0
4           1       0
5           1       0
6           1       0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!