Collapsing factor level for all the factor variable in dataframe based on the count

问题

I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. I tried this but it doesnt help.

df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))), 
              b=as.factor(c(rep('A',5),rep('B',5))), 
              c=as.factor(c(rep('A',3),rep('B',5),rep('C',2)))) 

myfun=function(x){
    if(is.factor(x)){
        levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'  
    }
}

df=as.data.frame(lapply(df, myfun))

Expected Output

       a b      c
       D A      A
       D A      A
       D A      A
       B A      B
       B A      B
       B B      B
       B B      B
       B B      B
  others B others
  others B others

回答1:

This might get a bit messy, but here is one approach via base R,

fun1 <- function(x){levels(x) <- 
                    c(names(sort(table(x), decreasing = TRUE)[1:2]), 
                    rep('others', length(levels(x))-2)); 
                    return(x)}

However the above function will need to first be re-ordered and as OP states in comment, the correct one will be,

fun1 <- function(x){ x=factor(x, 
                     levels = names(sort(table(x), decreasing = TRUE))); 
                     levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]), 
                     rep('others', length(levels(x))-2)); 
                     return(x) }

回答2:

This is now easy thanks to fct_lump() from the forcats package.

fct_lump(df$a, n = 2)

# [1] D     D     D     B     B     B     B     B     Other Other
# Levels: B D Other

The argument n controls the number of most common levels to be preserved, lumping together the others.

来源：https://stackoverflow.com/questions/38788682/collapsing-factor-level-for-all-the-factor-variable-in-dataframe-based-on-the-co

标签

lapply

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!