How to split data into training/testing sets using sample function

前端 未结 24 1813
猫巷女王i
猫巷女王i 2020-11-22 10:43

I\'ve just started using R and I\'m not sure how to incorporate my dataset with the following sample code:

sample(x, size, replace = FALSE, prob = NULL)
         


        
24条回答
  •  执笔经年
    2020-11-22 10:57

    Beware of sample for splitting if you look for reproducible results. If your data changes even slightly, the split will vary even if you use set.seed. For example, imagine the sorted list of IDs in you data is all the numbers between 1 and 10. If you just dropped one observation, say 4, sampling by location would yield a different results because now 5 to 10 all moved places.

    An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers. This sample is more stable because assignment is now determined by the hash of each observation, and not by its relative position.

    For example:

    require(openssl)  # for md5
    require(data.table)  # for the demo data
    
    set.seed(1)  # this won't help `sample`
    
    population <- as.character(1e5:(1e6-1))  # some made up ID names
    
    N <- 1e4  # sample size
    
    sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
    sample2 <- sample1[-sample(N, 1)]  # randomly drop one observation from sample1
    
    # samples are all but identical
    sample1
    sample2
    nrow(merge(sample1, sample2))
    

    [1] 9999

    # row splitting yields very different test sets, even though we've set the seed
    test <- sample(N-1, N/2, replace = F)
    
    test1 <- sample1[test, .(id)]
    test2 <- sample2[test, .(id)]
    nrow(test1)
    

    [1] 5000

    nrow(merge(test1, test2))
    

    [1] 2653

    # to fix that, we can use some hash function to sample on the last digit
    
    md5_bit_mod <- function(x, m = 2L) {
      # Inputs: 
      #  x: a character vector of ids
      #  m: the modulo divisor (modify for split proportions other than 50:50)
      # Output: remainders from dividing the first digit of the md5 hash of x by m
      as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
    }
    
    # hash splitting preserves the similarity, because the assignment of test/train 
    # is determined by the hash of each obs., and not by its relative location in the data
    # which may change 
    test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
    test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
    nrow(merge(test1a, test2a))
    

    [1] 5057

    nrow(test1a)
    

    [1] 5057

    sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

    See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo

提交回复
热议问题