问题
I am still relatively new to R, so apologies in advance if my question seems too basic.
My problem is as follows:
I have a data set containing several factor variables, which have the same categories. I need to find the category, which occurs most frequently for each observation across the factor variables. In case of ties an arbitrary value can be chosen, although it would be great if I can have more control over it.
My data set contains over a hundred factors. However, the structure is something like that:
id <- 1:3
var1 <- c("red","yellow","green")
var2 <- c("red","yellow","green")
var3 <- c("yellow","orange","green")
var4 <- c("orange","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))
> df
id var1 var2 var3 var4
1 1 red red yellow orange
2 2 yellow yellow orange green
3 3 green green green yellow
The solution should be a variable within the data frame, for example var5, which contains the most frequent category for each row. It can be a factor or a numeric vector (in case the data need to be converted first to numeric vectors)
In this case, I would like to have this solution:
> df$var5
[1] "red" "yellow" "green"
Any advice will be much appreciated! Thanks in advance!
回答1:
Something like :
apply(df,1,function(x) names(which.max(table(x))))
[1] "red" "yellow" "green"
In case there is a tie, which.max takes the first max value. From the which.max help page :
Determines the location, i.e., index of the (first) minimum or maximum of a numeric vector.
Ex :
var4 <- c("yellow","green","yellow")
df <- data.frame(cbind(id, var1, var2, var3, var4))
> df
id var1 var2 var3 var4
1 1 red red yellow yellow
2 2 yellow yellow orange green
3 3 green green green yellow
apply(df,1,function(x) names(which.max(table(x))))
[1] "red" "yellow" "green"
回答2:
If your data is quite big you might want to consider using the data.table
package.
# Generate the data
nrow <- 10^5
id <- 1:nrow
colors <- c("red","yellow","green")
var1 <- sample(colors, nrow, replace = TRUE)
var2 <- sample(colors, nrow, replace = TRUE)
var3 <- sample(colors, nrow, replace = TRUE)
var4 <- sample(colors, nrow, replace = TRUE)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Chargaff's solution is simple and works well in some cases. You can gain a small performance improvement (~20%) using data.table
.
df <- data.frame(cbind(id, var1, var2, var3, var4))
system.time(apply(df, 1, Mode))
# user system elapsed
# 1.242 0.018 1.264
library(data.table)
dt <- data.table(cbind(id, var1, var2, var3, var4))
system.time(melt(dt, measure = patterns('var'))[, Mode(value1), by = id])
# user system elapsed
# 1.020 0.012 1.034
来源:https://stackoverflow.com/questions/19982938/how-to-find-the-most-frequent-values-across-several-columns-containing-factors