Combining low frequency counts

后端未结

关注

 7  732

没有蜡笔的小新

Trying to collapse a nominal categorical vector by combining low frequency counts into an \'Other\' category:

The data (column of a dataframe) looks like this, and c

相关标签:

7条回答

旧巷少年郎

2020-12-03 20:19

Hadley Wickham's forcats package (available on CRAN since 2016-08-29) has a handy function fct_lump() which lumps together levels of a factor according to different criteria.

OP's requirement to lump together factors below a threshold of 0.02 can be achieved by

set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)

 [1] c     d     d     e     j     h     c     h     g     i     g     d    
[13] f     Other g     h     h     a     b     h     e     g     h     b    
[25] d     e     e     g     i     f     d     e     g     c     g     a    
[37] Other i     i     b     i     j     f     d     c     h     Other j    
[49] j     c     Other e     f     a     a     h     e     c     Other b    
Levels: a b c d e f g h i j Other

Note that the sample data from this answer has been used for comparison.

The function offers even more possibilities, e.g., it can keep the 5 factor levels with the lowest frequencies and lumps together the other levels:

forcats::fct_lump(a, n = -5)

 [1] Other Other Other Other Other Other Other Other Other Other Other Other
[13] Other D     Other Other Other Other Other Other Other Other Other Other
[25] Other Other Other Other Other Other Other Other Other Other Other Other
[37] B     Other Other Other Other Other Other Other Other Other E     Other
[49] Other Other C     Other Other Other Other Other Other Other A     Other
Levels: A B C D E Other

0 讨论(0)

上一页 1 2