Combining low frequency counts

后端 未结 7 730
没有蜡笔的小新
没有蜡笔的小新 2020-12-03 19:24

Trying to collapse a nominal categorical vector by combining low frequency counts into an \'Other\' category:

The data (column of a dataframe) looks like this, and c

7条回答
  •  执笔经年
    2020-12-03 20:01

    From the sounds of it, something like the following should work for you:

    condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
      toCondense <- names(which(prop.table(table(vector)) < threshold))
      vector[vector %in% toCondense] <- newName
      vector
    }
    

    Try it out:

    ## Sample data
    set.seed(1)
    a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
    
    round(prop.table(table(a)), 2)
    # a
    #    a    A    b    B    c    C    d    D    e    E    f    g    h 
    # 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13 
    #    i    j 
    # 0.08 0.07 
    
    a
    #  [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
    # [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
    # [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
    # [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"
    
    condenseMe(a)
    #  [1] "c"     "d"     "d"     "e"     "j"     "h"     "c"     "h"    
    #  [9] "g"     "i"     "g"     "d"     "f"     "Other" "g"     "h"    
    # [17] "h"     "a"     "b"     "h"     "e"     "g"     "h"     "b"    
    # [25] "d"     "e"     "e"     "g"     "i"     "f"     "d"     "e"    
    # [33] "g"     "c"     "g"     "a"     "Other" "i"     "i"     "b"    
    # [41] "i"     "j"     "f"     "d"     "c"     "h"     "Other" "j"    
    # [49] "j"     "c"     "Other" "e"     "f"     "a"     "a"     "h"    
    # [57] "e"     "c"     "Other" "b"   
    

    Note, however, that if you are dealing with factors, you should convert them with as.character first.

提交回复
热议问题