Here are all the variables I\'m working with:
str(ad.train)
$ Date : Factor w/ 427 levels \"2012-03-24\",\"2012-03-29\",..: 4 7 12 14 19 21 24
From my experience ten minutes ago this situation can happen where there are more than one category but with a lot of NAs. Taking the Kaggle Houseprice Dataset as example, if you loaded data and run a simple regression,
train.df = read.csv('train.csv')
lm1 = lm(SalePrice ~ ., data = train.df)
you will get same error. I also tried testing the number of levels of each factor, but none of them says it has less than 2 levels.
cols = colnames(train.df)
for (col in cols){
if(is.factor(train.df[[col]])){
cat(col, ' has ', length(levels(train.df[[col]])), '\n')
}
}
So after a long time I used summary(train.df) to see details of each col, and removed some, and it finally worked:
train.df = subset(train.df, select=-c(Id, PoolQC,Fence, MiscFeature, Alley, Utilities))
lm1 = lm(SalePrice ~ ., data = train.df)
and removing any one of them the regression fails to run again with same error (which I have tested myself).
Another way to debug this error with a lot of NAs is, replace each NA with the most common attributes of the column. Note the following method cannot debug where NA is the mode of the column, which I suggest drop these columns or substutite these columns manually, individually rather than applying a function working on the whole dataset like this:
fill.na.with.mode = function(df){
cols = colnames(df)
for (col in cols){
if(class(df[[col]])=='factor'){
x = summary(df[[col]])
mode = names(x[which.max(x)])
df[[col]][is.na(df[[col]])]=mode
}
else{
df[[col]][is.na(df[[col]])]=0
}
}
return (df)
}
And above attributes generally have 1400+ NAs and 10 useful values, so you might want to remove these garbage attributes, even they have 3 or 4 levels. I guess a function counting how many NAs in each column will help.