How to create dummy variables?

后端 未结 3 2011
梦谈多话
梦谈多话 2020-12-07 02:55

I have a variable that is a factor :

 $ year           : Factor w/ 8 levels \"2003\",\"2004\",..: 4 6 4 2 4 1 3 3 7 2 ...

I would like to c

相关标签:
3条回答
  • 2020-12-07 03:16

    library(caret) provides a very simple function (dummyVars) to create dummy variables, especially when you have more than one factor variables. But you have to make sure the target variables are factor. e.g. if your Sales$year are numeric, you have to convert them to factor: as.factor(Sales$year)

    Suppose we have the original dataset 'Sales' as follows:

        year    Sales       Region
    1   2010    3695.543    North
    2   2010    9873.037    West
    3   2008    3579.458    West
    4   2005    2788.857    North
    5   2005    2952.183    North
    6   2008    7255.337    West
    7   2005    5237.081    West
    8   2010    8987.096    North
    9   2008    5545.343    North
    10  2008    1809.446    West
    

    Now we can create two dummy variables simultaneously:

    >library(lattice)
    >library(ggplot2)
    >library(caret)
    >Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
    >Sdummy <- predict(Salesdummy, Sales)
    

    The outcome will be:

       2005 2008 2010   Sales    RegionNorth    RegionWest
    1   0    0    1   3695.543       1              0
    2   0    0    1   9873.037       0              1
    3   0    1    0   3579.458       0              1
    4   1    0    0   2788.857       1              0
    5   1    0    0   2952.183       1              0
    6   0    1    0   7255.337       0              1
    7   1    0    0   5237.081       0              1
    8   0    0    1   8987.096       1              0
    9   0    1    0   5545.343       1              0 
    10  0    1    0   1809.446       0              1
    
    0 讨论(0)
  • 2020-12-07 03:19

    This is as concise as I could get. The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ...

    options(na.action=na.pass)
    dt1 <- data.frame(year=factor(c(NA,2003:2005)))
    dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
                  c("year",levels(dt1$year)))
    

    As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.

      year 2003 2004 2005
    1 <NA>   NA   NA   NA
    2 2003    1    0    0
    3 2004    0    1    0
    4 2005    0    0    1
    
    0 讨论(0)
  • 2020-12-07 03:24

    You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"):

    dt1 <- data.frame(year=factor(rep(2003:2010, 10)))  # example data
    
    dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
    dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
    dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
    # ...    
    
    head(dt1)
    #   year yr2003 yr2004 yr2005
    # 1 2003      1      0      0
    # 2 2004      0      1      0
    # 3 2005      0      0      1
    # 4 2006      0      0      0
    # 5 2007      0      0      0
    # 6 2008      0      0      0
    
    0 讨论(0)
提交回复
热议问题