问题
I would like to write a function in R that takes a single factor variable and a parameter n as inputs, computes the number of cases per category in the factor variable, and only keeps those n categories with the most number of cases and pools all other categories into a category "other." This function must be applied to multiple variables, keeping the 2 largest categories for each variable and pooling all other categories in each variable into a category "other."
Example:
var1 <- c("square", "square", "square", "circle", "square", "square", "circle",
"square", "circle", "circle", "circle", "circle", "square", "circle", "triangle", "circle", "circle", "rectangle")
var2 <- c("orange", "orange", "orange", "orange", "blue", "orange", "blue",
"blue", "orange", "blue", "blue", "blue", "orange", "orange", "orange", "orange", "green", "purple")
df <- data.frame(var1, var2)
Thank you so much!
回答1:
forcats::fct_lump_n()
exists for precisely this:
library(forcats)
library(dplyr)
df %>%
mutate_all(fct_lump_n, 2)
var1 var2
1 square orange
2 square orange
3 square orange
4 circle orange
5 square blue
6 square orange
7 circle blue
8 square blue
9 circle orange
10 circle blue
11 circle blue
12 circle blue
13 square orange
14 circle orange
15 Other orange
16 circle orange
17 circle Other
18 Other Other
回答2:
You can do that with data.table
. There is probably a more elegant way to do it but it seems to work
library(data.table)
myfunc <- function(x, n = 10){
xvar <- data.table::as.data.table('x' = x)
dt <- xvar[,.('count' = .N), by = "x"][order(-get('count'))]
dt[, "category" := as.character(get("x"))]
dt[, 'rk' := (seq_len(.N)<=n)]
dt[!get('rk'), c('category') := "other"]
dt <- merge(xvar,dt, by = "x")
return(dt$category)
}
I coerce your example dataframe as a data.table
object
var1 <- c("square", "square", "square", "circle", "square", "square", "circle",
"square", "circle", "circle", "circle", "circle", "square", "circle", "triangle", "circle", "circle", "rectangle")
var2 <- c("orange", "orange", "orange", "orange", "blue", "orange", "blue",
"blue", "orange", "blue", "blue", "blue", "orange", "orange", "orange", "orange", "green", "purple")
df <- data.frame(var1, var2)
df2 <- as.data.table(df)
Then, the call is quite easy:
df2[,lapply(.SD, myfunc, n = 3)]
var1 var2
1: circle blue
2: circle blue
3: circle blue
4: circle blue
5: circle blue
6: circle blue
7: circle green
8: circle orange
9: circle orange
10: other orange
11: square orange
12: square orange
13: square orange
14: square orange
15: square orange
16: square orange
17: square orange
18: triangle other
data.table
object is a special data.frame
thus you don't need to coerce it back to data.frame class
来源:https://stackoverflow.com/questions/61099087/write-a-function-in-r-to-group-factor-levels-by-frequency-then-keep-the-2-large