Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors?
I want to achieve something similar t
Here's an approach using base R:
set.seed(123)
d <- data.frame(x = sample(LETTERS[1:5], 1e5, prob = c(.4, .3, .2, .05, .05), replace = TRUE))
recat <- function(x, new_cat, threshold) {
x <- as.character(x)
xt <- prop.table(table(x))
factor(ifelse(x %in% names(xt)[xt >= threshold], x, new_cat))
}
d$new_cat <- recat(d$x, "O", 0.1)
table(d$new_cat)
# A B C O
# 40132 29955 19974 9939