Combining low frequency counts

后端 未结 7 732
没有蜡笔的小新
没有蜡笔的小新 2020-12-03 19:24

Trying to collapse a nominal categorical vector by combining low frequency counts into an \'Other\' category:

The data (column of a dataframe) looks like this, and c

相关标签:
7条回答
  • 2020-12-03 20:19

    Hadley Wickham's forcats package (available on CRAN since 2016-08-29) has a handy function fct_lump() which lumps together levels of a factor according to different criteria.

    OP's requirement to lump together factors below a threshold of 0.02 can be achieved by

    set.seed(1)
    a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
    forcats::fct_lump(a, prop = 0.02)
    
     [1] c     d     d     e     j     h     c     h     g     i     g     d    
    [13] f     Other g     h     h     a     b     h     e     g     h     b    
    [25] d     e     e     g     i     f     d     e     g     c     g     a    
    [37] Other i     i     b     i     j     f     d     c     h     Other j    
    [49] j     c     Other e     f     a     a     h     e     c     Other b    
    Levels: a b c d e f g h i j Other
    

    Note that the sample data from this answer has been used for comparison.


    The function offers even more possibilities, e.g., it can keep the 5 factor levels with the lowest frequencies and lumps together the other levels:

    forcats::fct_lump(a, n = -5)
    
     [1] Other Other Other Other Other Other Other Other Other Other Other Other
    [13] Other D     Other Other Other Other Other Other Other Other Other Other
    [25] Other Other Other Other Other Other Other Other Other Other Other Other
    [37] B     Other Other Other Other Other Other Other Other Other E     Other
    [49] Other Other C     Other Other Other Other Other Other Other A     Other
    Levels: A B C D E Other
    
    0 讨论(0)
提交回复
热议问题