How to find the most frequent values across several columns containing factors

问题

I am still relatively new to R, so apologies in advance if my question seems too basic.

My problem is as follows:

I have a data set containing several factor variables, which have the same categories. I need to find the category, which occurs most frequently for each observation across the factor variables. In case of ties an arbitrary value can be chosen, although it would be great if I can have more control over it.

My data set contains over a hundred factors. However, the structure is something like that:

id <- 1:3
var1 <- c("red","yellow","green")
var2 <- c("red","yellow","green")
var3 <- c("yellow","orange","green")
var4 <- c("orange","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))


> df
  id   var1   var2   var3   var4
1  1    red    red yellow orange
2  2 yellow yellow orange  green
3  3  green  green  green yellow

The solution should be a variable within the data frame, for example var5, which contains the most frequent category for each row. It can be a factor or a numeric vector (in case the data need to be converted first to numeric vectors)

In this case, I would like to have this solution:

> df$var5
[1] "red"    "yellow" "green"

Any advice will be much appreciated! Thanks in advance!

回答1:

Something like :

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green"

In case there is a tie, which.max takes the first max value. From the which.max help page :

Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector.

Ex :

var4 <- c("yellow","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))

> df
  id   var1   var2   var3   var4
1  1    red    red yellow yellow
2  2 yellow yellow orange  green
3  3  green  green  green yellow

apply(df,1,function(x) names(which.max(table(x))))
[1] "red"    "yellow" "green"

回答2:

If your data is quite big you might want to consider using the data.table package.

# Generate the data
nrow <- 10^5
id <- 1:nrow
colors <- c("red","yellow","green")
var1 <- sample(colors, nrow, replace = TRUE)
var2 <- sample(colors, nrow, replace = TRUE)
var3 <- sample(colors, nrow, replace = TRUE)
var4 <- sample(colors, nrow, replace = TRUE)

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

Chargaff's solution is simple and works well in some cases. You can gain a small performance improvement (~20%) using data.table.

df <- data.frame(cbind(id, var1, var2, var3, var4))
system.time(apply(df, 1, Mode))
#   user  system elapsed
#  1.242   0.018   1.264

library(data.table)
dt <- data.table(cbind(id, var1, var2, var3, var4))
system.time(melt(dt, measure = patterns('var'))[, Mode(value1), by = id])
#   user  system elapsed
#  1.020   0.012   1.034

来源：https://stackoverflow.com/questions/19982938/how-to-find-the-most-frequent-values-across-several-columns-containing-factors

标签

mode

factors