How to find the most frequent values across several columns containing factors

爱⌒轻易说出口 提交于 2019-12-01 16:33:59

Something like :

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green" 

In case there is a tie, which.max takes the first max value. From the which.max help page :

Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector.

Ex :

var4 <- c("yellow","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))

> df
  id   var1   var2   var3   var4
1  1    red    red yellow yellow
2  2 yellow yellow orange  green
3  3  green  green  green yellow

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green" 

If your data is quite big you might want to consider using the data.table package.

# Generate the data
nrow <- 10^5
id <- 1:nrow
colors <- c("red","yellow","green")
var1 <- sample(colors, nrow, replace = TRUE)
var2 <- sample(colors, nrow, replace = TRUE)
var3 <- sample(colors, nrow, replace = TRUE)
var4 <- sample(colors, nrow, replace = TRUE)

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

Chargaff's solution is simple and works well in some cases. You can gain a small performance improvement (~20%) using data.table.

df <- data.frame(cbind(id, var1, var2, var3, var4))
system.time(apply(df, 1, Mode))
#   user  system elapsed
#  1.242   0.018   1.264

library(data.table)
dt <- data.table(cbind(id, var1, var2, var3, var4))
system.time(melt(dt, measure = patterns('var'))[, Mode(value1), by = id])
#   user  system elapsed
#  1.020   0.012   1.034
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!