How to find the statistical mode?

前端 未结 30 2169
时光取名叫无心
时光取名叫无心 2020-11-21 07:00

In R, mean() and median() are standard functions which do what you\'d expect. mode() tells you the internal storage mode of the objec

30条回答
  •  耶瑟儿~
    2020-11-21 08:02

    I was looking through all these options and started to wonder about their relative features and performances, so I did some tests. In case anyone else are curious about the same, I'm sharing my results here.

    Not wanting to bother about all the functions posted here, I chose to focus on a sample based on a few criteria: the function should work on both character, factor, logical and numeric vectors, it should deal with NAs and other problematic values appropriately, and output should be 'sensible', i.e. no numerics as character or other such silliness.

    I also added a function of my own, which is based on the same rle idea as chrispy's, except adapted for more general use:

    library(magrittr)
    
    Aksel <- function(x, freq=FALSE) {
        z <- 2
        if (freq) z <- 1:2
        run <- x %>% as.vector %>% sort %>% rle %>% unclass %>% data.frame
        colnames(run) <- c("freq", "value")
        run[which(run$freq==max(run$freq)), z] %>% as.vector   
    }
    
    set.seed(2)
    
    F <- sample(c("yes", "no", "maybe", NA), 10, replace=TRUE) %>% factor
    Aksel(F)
    
    # [1] maybe yes  
    
    C <- sample(c("Steve", "Jane", "Jonas", "Petra"), 20, replace=TRUE)
    Aksel(C, freq=TRUE)
    
    # freq value
    #    7 Steve
    

    I ended up running five functions, on two sets of test data, through microbenchmark. The function names refer to their respective authors:

    Chris' function was set to method="modes" and na.rm=TRUE by default to make it more comparable, but other than that the functions were used as presented here by their authors.

    In matter of speed alone Kens version wins handily, but it is also the only one of these that will only report one mode, no matter how many there really are. As is often the case, there's a trade-off between speed and versatility. In method="mode", Chris' version will return a value iff there is one mode, else NA. I think that's a nice touch. I also think it's interesting how some of the functions are affected by an increased number of unique values, while others aren't nearly as much. I haven't studied the code in detail to figure out why that is, apart from eliminating logical/numeric as a the cause.

提交回复
热议问题