I am getting the following error
c50 code called exit with value 1
I am doing this on the titanic data available from Kaggle
<For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.
Regarding your problem, first of I think you meant to write
new_model <- C5.0(train[,-2],train$Survived)
Next, notice the structure of the Cabin
and Embarked
Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)
). This is the point where C50
falls over. If you modify your data such that
levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"
your algorithm will now run without an error.
Here is what worked finally:-
Got this idea after reading this post
library(C50)
test$Survived <- NA
combinedData <- rbind(train,test)
combinedData$Survived <- factor(combinedData$Survived)
# fixing empty character level names
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"
new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]
new_model <- C5.0(new_train[,-2],new_train$Survived)
new_model_predict <- predict(new_model,new_test)
submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)
The intuition behind this is that in this way both the train and test data set will have consistent factor levels.
I had the same error, but I was using a numeric dataset without missing values.
After a long time, I discovered that my dataset had a predictive attribute called "outcome"
and the C5.0Control
use this name, and this was the error cause :'(
My solution was changing the column name. Other way, would be create a C5.0Control
object and change the value of the label attribute and then pass this object as parameter for the C50 method.
I also struggled some hours with the same Problem (Return code "1") when building a model as well as when predicting. With the hint of answer of Marco I have written a small function to remove all factor levels equal to "" in a data frame or vector, see code below. However, since R does not allow for pass by reference to functions, you have to use the result of the function (it can not change the original dataframe):
removeBlankLevelsInDataFrame <- function(dataframe) {
for (i in 1:ncol(dataframe)) {
levels <- levels(dataframe[, i])
if (!is.null(levels) && levels[1] == "") {
levels(dataframe[,i])[1] = "?"
}
}
dataframe
}
removeBlankLevelsInVector <- function(vector) {
levels <- levels(vector)
if (!is.null(levels) && levels[1] == "") {
levels(vector)[1] = "?"
}
vector
}
Call of the functions may look like this:
trainX = removeBlankLevelsInDataFrame(trainX)
trainY = removeBlankLevelsInVector(trainY)
model = C50::C5.0.default(trainX,trainY)
However, it seems, that C50 has a similar Problem with character columns containing an empty cell, so you will have probably to extend this to handle also character attributes if you have some.
Just in case. You can take a look to the error by
summary(new_model)
Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.
I also got the same error, but it was because of some illegal characters in the factor levels of one the columns.
I used make.names
function and corrected the factor levels:
levels(FooData$BarColumn) <- make.names(levels(FooData$BarColumn))
Then the problem was resolved.