Reduce number of levels for large categorical variables

后端 未结 4 422
慢半拍i
慢半拍i 2020-12-11 12:14

Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors?

I want to achieve something similar t

相关标签:
4条回答
  • 2020-12-11 12:32

    Here's an approach using base R:

    set.seed(123)
    d <- data.frame(x = sample(LETTERS[1:5], 1e5, prob = c(.4, .3, .2, .05, .05), replace = TRUE))
    
    recat <- function(x, new_cat, threshold) {
        x <- as.character(x)
        xt <- prop.table(table(x))
        factor(ifelse(x %in% names(xt)[xt >= threshold], x, new_cat))
    }
    
    d$new_cat <- recat(d$x, "O", 0.1)
    table(d$new_cat)
    #     A     B     C     O 
    # 40132 29955 19974  9939 
    
    0 讨论(0)
  • 2020-12-11 12:37

    Here is an example in R using data.table a bit, but it should be easy without data.table also.

    # Load data.table
    require(data.table)
    
    # Some data
    set.seed(1)
    dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
                     weight = rnorm(n = 10e3, mean = 70, sd = 20))
    
    # Decide the minimum frequency a level needs...
    min.freq <- 3350
    
    # Levels that don't meet minumum frequency (using data.table)
    fail.min.f <- dt[, .N, type][N < min.freq, type]
    
    # Call all these level "Other"
    levels(dt$type)[fail.min.f] <- "Other"
    
    0 讨论(0)
  • 2020-12-11 12:45

    I do not think you want to do it in this way. Grouping many levels into one group might make that feature less predictive. What you want to do is put all the levels that would go into Other into a cluster based on a similarity metric. Some of them might cluster with your top-K levels and some might cluster together to give best performance.

    I had a similar issue and ended up answering it myself here. For my similarity metric I used the proximity matrix from a random forest regression fit on all features except that one. The difference in my solution is that some of my top-K most common may be clustered together since I use k-mediods to cluster. You would want to alter the cluster algorithm so that your mediods are the top-K you have chosen.

    0 讨论(0)
  • 2020-12-11 12:47

    The R package forcats has fct_lump() for this purpose.

    library(forcats)
    fct_lump(f, n)
    

    Where f is the factor and n is the number of most common levels to be preserved. The remaining are recoded to Other.

    0 讨论(0)
提交回复
热议问题