stratified splitting the data

前端 未结 4 1106
礼貌的吻别
礼貌的吻别 2020-12-15 08:12

I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City g

相关标签:
4条回答
  • 2020-12-15 08:24

    The typical way is with split

    lapply( split(dfrm, dfrm$City), function(dd){
                indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
                train= dd[indexes, ]    # Notice that you may want all columns
                test= dd[-indexes, ]
                # analysis goes here
                }
    

    If you were to do it in steps as you attempted above it would be like this:

    cities <- split(data,data$city)
    
    idxs <- lapply(cities, function (d) {
        indexes <- sample(1:nrow(d), size=0.7*nrow(d))
    })
    
    train <- data[ idxs[[1]], ]  # for the first city
    test <-  data[ -idxs[[1]], ]
    

    I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.

    0 讨论(0)
  • 2020-12-15 08:25

    The package splitstackshape has a nice function stratified which can do this as well, but this is a bit better than createDataPartition because it can use multiple columns to stratify at once. It can be used with one column like:

    library(splitstackshape)
    set.seed(42)  # good idea to set the random seed for reproducibility
    stratified(data, c('City'), 0.7)
    

    Or with multiple columns:

    stratified(data, c('City', 'column2'), 0.7)
    
    0 讨论(0)
  • 2020-12-15 08:27

    Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function

    logReg<-function(ind) {
        reg<-glm(train[,ind]~WHATEVER)
        ....
        return(val) }
    

    Then run sapply over the vector of city indexes.

    0 讨论(0)
  • 2020-12-15 08:34

    Try createDataPartition from caret package. Its document states: By default, createDataPartition does a stratified random split of the data.

    library(caret)
    train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
    train <- Data[ train.index,]
    test  <- Data[-train.index,]
    

    it can also be used for stratified K-fold like:

    ctrl <- trainControl(method = "repeatedcv",
                         repeats = 3,
                         ...)
    # when calling train, pass this train control
    train(...,
          trControl = ctrl,
          ...)
    

    check out caret document for more details

    0 讨论(0)
提交回复
热议问题