How to debug “contrasts can be applied only to factors with 2 or more levels” error?

后端 未结 3 1634
后悔当初
后悔当初 2020-11-21 23:32

Here are all the variables I\'m working with:

str(ad.train)
$ Date                : Factor w/ 427 levels \"2012-03-24\",\"2012-03-29\",..: 4 7 12 14 19 21 24         


        
3条回答
  •  感动是毒
    2020-11-21 23:56

    From my experience ten minutes ago this situation can happen where there are more than one category but with a lot of NAs. Taking the Kaggle Houseprice Dataset as example, if you loaded data and run a simple regression,

    train.df = read.csv('train.csv')
    lm1 = lm(SalePrice ~ ., data = train.df)
    

    you will get same error. I also tried testing the number of levels of each factor, but none of them says it has less than 2 levels.

    cols = colnames(train.df)
    for (col in cols){
      if(is.factor(train.df[[col]])){
        cat(col, ' has ', length(levels(train.df[[col]])), '\n')
      }
    }
    

    So after a long time I used summary(train.df) to see details of each col, and removed some, and it finally worked:

    train.df = subset(train.df, select=-c(Id, PoolQC,Fence, MiscFeature, Alley, Utilities))
    lm1 = lm(SalePrice ~ ., data = train.df)
    

    and removing any one of them the regression fails to run again with same error (which I have tested myself).

    Another way to debug this error with a lot of NAs is, replace each NA with the most common attributes of the column. Note the following method cannot debug where NA is the mode of the column, which I suggest drop these columns or substutite these columns manually, individually rather than applying a function working on the whole dataset like this:

    fill.na.with.mode = function(df){
        cols = colnames(df)
        for (col in cols){
            if(class(df[[col]])=='factor'){
                x = summary(df[[col]])
                mode = names(x[which.max(x)])
                df[[col]][is.na(df[[col]])]=mode
            }
            else{
                df[[col]][is.na(df[[col]])]=0
            }
        }
        return (df)
    }
    

    And above attributes generally have 1400+ NAs and 10 useful values, so you might want to remove these garbage attributes, even they have 3 or 4 levels. I guess a function counting how many NAs in each column will help.

提交回复
热议问题