How do I avoid time leakage in my KNN model?

我的未来我决定 提交于 2019-12-21 18:05:24

问题


I am building a KNN model to predict housing prices. I'll go through my data and my model and then my problem.

Data -

# A tibble: 81,334 x 4
   latitude longitude close_date          close_price
      <dbl>     <dbl> <dttm>                    <dbl>
 1     36.4     -98.7 2014-08-05 06:34:00     147504.
 2     36.6     -97.9 2014-08-12 23:48:00     137401.
 3     36.6     -97.9 2014-08-09 04:00:40     239105.

Model -

library(caret)
training.samples <- data$close_price %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]

model <- train(
  close_price~ ., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center", "scale"),
  tuneLength = 10
)

My problem is time leakage. I am making predictions on a house using other houses that closed afterwards and in the real world I shouldn't have access to that information.

I want to apply a rule to the model that says, for each value y, only use houses that closed before the house for that y. I know I could split my test data and my train data on a certain date, but that doesn't quite do it.

Is it possible to prevent this time leakage, either in caret or other libraries for knn (like class and kknn)?


回答1:


In caret, createTimeSlices implements a variation of cross-validation adapted to time series (avoiding time leakage by rolling the forecasting origin). Documentation is here.

In your case, depending on your precise needs, you could use something like this for a proper cross-validation:

your_data <- your_data %>% arrange(close_date)

tr_ctrl <- createTimeSlices(
  your_data$close_price, 
  initialWindow  = 10, 
  horizon = 1,
  fixedWindow = FALSE)

model <- train(
  close_price~ ., data = your_data, method = "knn",
  trControl = tr_ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10
)

EDIT: if you have ties in the dates and want to having deals closed on the same day in the test and train sets, you can fix tr_ctrl before using it in train:

filter_train <- function(i_tr, i_te) {
  d_tr <- as_date(your_data$close_date[i_tr]) #using package lubridate
  d_te <- as_date(your_data$close_date[i_te])
  tr_is_ok <- d_tr < min(d_te)

  i_tr[tr_is_ok]
}

tr_ctrl$train <- mapply(filter_train, tr_ctrl$train, tr_ctrl$test)


来源:https://stackoverflow.com/questions/56229145/how-do-i-avoid-time-leakage-in-my-knn-model

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!