Reduce number of levels for large categorical variables

后端 未结 4 426
慢半拍i
慢半拍i 2020-12-11 12:14

Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors?

I want to achieve something similar t

4条回答
  •  鱼传尺愫
    2020-12-11 12:32

    Here's an approach using base R:

    set.seed(123)
    d <- data.frame(x = sample(LETTERS[1:5], 1e5, prob = c(.4, .3, .2, .05, .05), replace = TRUE))
    
    recat <- function(x, new_cat, threshold) {
        x <- as.character(x)
        xt <- prop.table(table(x))
        factor(ifelse(x %in% names(xt)[xt >= threshold], x, new_cat))
    }
    
    d$new_cat <- recat(d$x, "O", 0.1)
    table(d$new_cat)
    #     A     B     C     O 
    # 40132 29955 19974  9939 
    

提交回复
热议问题