How to find the most frequent values across several columns containing factors

删除回忆录丶 提交于 2019-12-01 15:16:33

问题


I am still relatively new to R, so apologies in advance if my question seems too basic.

My problem is as follows:

I have a data set containing several factor variables, which have the same categories. I need to find the category, which occurs most frequently for each observation across the factor variables. In case of ties an arbitrary value can be chosen, although it would be great if I can have more control over it.

My data set contains over a hundred factors. However, the structure is something like that:

id <- 1:3
var1 <- c("red","yellow","green")
var2 <- c("red","yellow","green")
var3 <- c("yellow","orange","green")
var4 <- c("orange","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))


> df
  id   var1   var2   var3   var4
1  1    red    red yellow orange
2  2 yellow yellow orange  green
3  3  green  green  green yellow

The solution should be a variable within the data frame, for example var5, which contains the most frequent category for each row. It can be a factor or a numeric vector (in case the data need to be converted first to numeric vectors)

In this case, I would like to have this solution:

> df$var5
[1] "red"    "yellow" "green" 

Any advice will be much appreciated! Thanks in advance!


回答1:


Something like :

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green" 

In case there is a tie, which.max takes the first max value. From the which.max help page :

Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector.

Ex :

var4 <- c("yellow","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))

> df
  id   var1   var2   var3   var4
1  1    red    red yellow yellow
2  2 yellow yellow orange  green
3  3  green  green  green yellow

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green" 



回答2:


If your data is quite big you might want to consider using the data.table package.

# Generate the data
nrow <- 10^5
id <- 1:nrow
colors <- c("red","yellow","green")
var1 <- sample(colors, nrow, replace = TRUE)
var2 <- sample(colors, nrow, replace = TRUE)
var3 <- sample(colors, nrow, replace = TRUE)
var4 <- sample(colors, nrow, replace = TRUE)

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

Chargaff's solution is simple and works well in some cases. You can gain a small performance improvement (~20%) using data.table.

df <- data.frame(cbind(id, var1, var2, var3, var4))
system.time(apply(df, 1, Mode))
#   user  system elapsed
#  1.242   0.018   1.264

library(data.table)
dt <- data.table(cbind(id, var1, var2, var3, var4))
system.time(melt(dt, measure = patterns('var'))[, Mode(value1), by = id])
#   user  system elapsed
#  1.020   0.012   1.034


来源:https://stackoverflow.com/questions/19982938/how-to-find-the-most-frequent-values-across-several-columns-containing-factors

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!