Trying to collapse a nominal categorical vector by combining low frequency counts into an \'Other\' category:
The data (column of a dataframe) looks like this, and c
Hadley Wickham's forcats
package (available on CRAN since 2016-08-29) has a handy function fct_lump()
which lumps together levels of a factor according to different criteria.
OP's requirement to lump together factors below a threshold of 0.02 can be achieved by
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)
[1] c d d e j h c h g i g d [13] f Other g h h a b h e g h b [25] d e e g i f d e g c g a [37] Other i i b i j f d c h Other j [49] j c Other e f a a h e c Other b Levels: a b c d e f g h i j Other
Note that the sample data from this answer has been used for comparison.
The function offers even more possibilities, e.g., it can keep the 5 factor levels with the lowest frequencies and lumps together the other levels:
forcats::fct_lump(a, n = -5)
[1] Other Other Other Other Other Other Other Other Other Other Other Other [13] Other D Other Other Other Other Other Other Other Other Other Other [25] Other Other Other Other Other Other Other Other Other Other Other Other [37] B Other Other Other Other Other Other Other Other Other E Other [49] Other Other C Other Other Other Other Other Other Other A Other Levels: A B C D E Other