Write a function in R to group factor levels by frequency, then keep the 2 largest categories and pool the rest in “other” [closed]

问题

I would like to write a function in R that takes a single factor variable and a parameter n as inputs, computes the number of cases per category in the factor variable, and only keeps those n categories with the most number of cases and pools all other categories into a category "other." This function must be applied to multiple variables, keeping the 2 largest categories for each variable and pooling all other categories in each variable into a category "other."

Example:

var1 <- c("square", "square", "square", "circle", "square", "square", "circle",
"square", "circle", "circle", "circle", "circle", "square", "circle", "triangle", "circle", "circle", "rectangle")

var2 <- c("orange", "orange", "orange", "orange", "blue", "orange", "blue",
"blue", "orange", "blue", "blue", "blue", "orange", "orange", "orange", "orange", "green", "purple")

df <- data.frame(var1, var2)

Thank you so much!

回答1:

forcats::fct_lump_n() exists for precisely this:

library(forcats)
library(dplyr)

df %>%
  mutate_all(fct_lump_n, 2)

     var1   var2
1  square orange
2  square orange
3  square orange
4  circle orange
5  square   blue
6  square orange
7  circle   blue
8  square   blue
9  circle orange
10 circle   blue
11 circle   blue
12 circle   blue
13 square orange
14 circle orange
15  Other orange
16 circle orange
17 circle  Other
18  Other  Other

回答2:

You can do that with data.table. There is probably a more elegant way to do it but it seems to work

library(data.table)

myfunc <- function(x, n = 10){

  xvar <- data.table::as.data.table('x' = x)
  dt <- xvar[,.('count' = .N), by = "x"][order(-get('count'))]

  dt[, "category" := as.character(get("x"))]
  dt[, 'rk' := (seq_len(.N)<=n)]
  dt[!get('rk'), c('category') := "other"]

  dt <- merge(xvar,dt, by = "x")

  return(dt$category)
}

I coerce your example dataframe as a data.table object

var1 <- c("square", "square", "square", "circle", "square", "square", "circle",
          "square", "circle", "circle", "circle", "circle", "square", "circle", "triangle", "circle", "circle", "rectangle")

var2 <- c("orange", "orange", "orange", "orange", "blue", "orange", "blue",
          "blue", "orange", "blue", "blue", "blue", "orange", "orange", "orange", "orange", "green", "purple")

df <- data.frame(var1, var2)

df2 <- as.data.table(df)

Then, the call is quite easy:

df2[,lapply(.SD, myfunc, n = 3)]

     var1   var2
 1:   circle   blue
 2:   circle   blue
 3:   circle   blue
 4:   circle   blue
 5:   circle   blue
 6:   circle   blue
 7:   circle  green
 8:   circle orange
 9:   circle orange
10:    other orange
11:   square orange
12:   square orange
13:   square orange
14:   square orange
15:   square orange
16:   square orange
17:   square orange
18: triangle  other

data.table object is a special data.frame thus you don't need to coerce it back to data.frame class

来源：https://stackoverflow.com/questions/61099087/write-a-function-in-r-to-group-factor-levels-by-frequency-then-keep-the-2-large

标签

function

group-by