Combining low frequency counts

后端 未结 7 731
没有蜡笔的小新
没有蜡笔的小新 2020-12-03 19:24

Trying to collapse a nominal categorical vector by combining low frequency counts into an \'Other\' category:

The data (column of a dataframe) looks like this, and c

相关标签:
7条回答
  • 2020-12-03 19:54

    Using the package dplyr, and assuming your data frame (let's call it State) has one field called ID for each State name...

    filtered_data <-  State %>% group_by(ID) %>% summarise(n = n(), 
                                                           freq = n/nrow(State),  
                                                           above_thresh = freq > 0.2) 
    
    filtered_data$State[filtered_data$above_thres == TRUE] <- "above_0.2"
    

    effectively what this does is gives the state name of anything with a frequency of 0.2, the label "above_0.2".

    0 讨论(0)
  • 2020-12-03 19:58

    I'm including an option that uses makes the change to the factor levels attribute as an alternative to Ananda Mahto's answer. The efficiency (based on a factor of 10,000 observations with 10 factors) isn't much different (about 150 microseconds), so your choice of option would likely depend on if you want to work with factors or with strings. Personally, I'd choose to work with strings and would use Ananda's approach.

    #* Using a list assignment to levels
    combineByThreshold <- function(x, threshold = .02, label = "Other")
    {
      prop <- prop.table(table(x))
      combine <- which(prop < threshold)
    
      levels(x) <- c(mapply(identity, levels(x)[-combine], SIMPLIFY = FALSE, USE.NAMES=TRUE),
                     setNames(list(levels(x)[combine]),
                         "Other"))
      x
    }
    
    #* Ananda Mahto
    condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
      toCondense <- names(which(prop.table(table(vector)) < 0.02))
      vector[vector %in% toCondense] <- newName
      vector
    }
    
    #* F.R.
    collapsecategory <- function(x, p) {
    levels_len = length(levels(x))
    levels(x)[levels_len+1] = 'Other'
    y = table(x)/length(x)
    y1 = as.vector(y)
    y2 = names(y)
    y2_len = length(y2)
    
    for (i in 1:y2_len) {
        if (y1[i]<=p){
              x[x==y2[i]] = 'Other'
            }
         }
    x <- droplevels(x)
    x
    }
    
    #* Steven Le
    
    filtered_data <-  State %>% group_by(ID) %>% summarise(n = n(), 
                                                           freq = n/nrow(State),  
                                                           above_thresh = freq > 0.2) 
    
    filtered_data$State[filtered_data$above_thres == TRUE] <- "above_0.2"
    
    
    #**************************
    #* Comparison of methods
    library(microbenchmark)
    library(dplyr)
    
    x <- sample(LETTERS[1:10], 10000, 
                prob = rep(c(1, 20), c(4, 6)),
                replace = TRUE)
    
    x <- as.factor(x)
    
    DF <- data.frame(x = x) %>%
      mutate(x = as.character(x),
             orig = x)
    
    microbenchmark(
      combineByThreshold = combineByThreshold(x),
      condenseMe = condenseMe(as.character(x)),
      collapsecategory = collapsecategory(x, .02),
      dplyr = {Ref <- DF %>%
                 group_by(orig, x) %>%
                 summarise(n = n(), 
                           freq = n/nrow(DF),  
                           above_thresh = freq < 0.02)
               Ref$x[Ref$above_thres == TRUE] <- "Other"
               DF <- left_join(DF[, "orig", drop=FALSE], Ref[, c("orig", "x")],
                               by = c("orig" = "orig"))
      }
    )
    Unit: microseconds
                   expr      min        lq      mean    median        uq      max neval
     combineByThreshold  694.108  715.0740  872.8452  735.4550  771.5235 3403.971   100
             condenseMe  756.275  771.9635  893.2784  789.5585  827.9730 3753.223   100
       collapsecategory 3679.032 3713.1955 4156.7700 3772.1375 3949.4030 6852.219   100
                  dplyr 4101.596 4229.0105 4515.9785 4408.6220 4602.4560 7772.711   100
    

    Brief disclaimer: I may not have represented Steven Le's answer fairly here. I interpreted his answer as generating a reference table of the new factor levels that could be merged back into the original data. Someone please correct me if I've misinterpreted.

    0 讨论(0)
  • 2020-12-03 20:01

    From the sounds of it, something like the following should work for you:

    condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
      toCondense <- names(which(prop.table(table(vector)) < threshold))
      vector[vector %in% toCondense] <- newName
      vector
    }
    

    Try it out:

    ## Sample data
    set.seed(1)
    a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
    
    round(prop.table(table(a)), 2)
    # a
    #    a    A    b    B    c    C    d    D    e    E    f    g    h 
    # 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13 
    #    i    j 
    # 0.08 0.07 
    
    a
    #  [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
    # [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
    # [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
    # [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"
    
    condenseMe(a)
    #  [1] "c"     "d"     "d"     "e"     "j"     "h"     "c"     "h"    
    #  [9] "g"     "i"     "g"     "d"     "f"     "Other" "g"     "h"    
    # [17] "h"     "a"     "b"     "h"     "e"     "g"     "h"     "b"    
    # [25] "d"     "e"     "e"     "g"     "i"     "f"     "d"     "e"    
    # [33] "g"     "c"     "g"     "a"     "Other" "i"     "i"     "b"    
    # [41] "i"     "j"     "f"     "d"     "c"     "h"     "Other" "j"    
    # [49] "j"     "c"     "Other" "e"     "f"     "a"     "a"     "h"    
    # [57] "e"     "c"     "Other" "b"   
    

    Note, however, that if you are dealing with factors, you should convert them with as.character first.

    0 讨论(0)
  • 2020-12-03 20:13

    Seems to work, but it's quite ugly. Is there a more elegant solution?

    collapsecatetgory <- function(x, p) {
    levels_len = length(levels(x))
    levels(x)[levels_len+1] = 'Other'
    y = table(x)/length(x)
    y1 = as.vector(y)
    y2 = names(y)
    y2_len = length(y2)
    
    for (i in 1:y2_len) {
        if (y1[i]<=p){
              x[x==y2[i]] = 'Other'
            }
         }
    x <- droplevels(x)
    x
    }
    
    0 讨论(0)
  • 2020-12-03 20:16

    A little late to the game, but you may use my package DataExplorer. The group_category function is exactly what you are looking for. There are other options too, you can type ?group_category to find out more.

    For example, in your case:

    library(DataExplorer)
    group_category(data, "colname", 0.02, update = TRUE)
    

    Here are more examples.

    0 讨论(0)
  • 2020-12-03 20:19

    I used an upadated version of the condense me function:

    condenseMe <- function(vector, name, limit) {
    
      toCondense <- names(which(prop.table(table(vector)) < limit))
      levels(vector)[levels(vector) %in% toCondense] <- name
    
      vector
    }
    

    Note: If among the levels there is NA set as level, in some cases condenseMe function will replace NA level with NA as missing values. That´s what happened to me.

    0 讨论(0)
提交回复
热议问题