Elegant way to drop rare factor levels from data frame

泄露秘密 提交于 2019-12-04 19:07:20

问题


I want to subset a dataframe by factor. I only want to retain factor levels above a certain frequency.

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))

This code creates data frame:

   factor    variable
1       a -1.55902013
2       a  0.22355431
3       a -1.52195456
4       a -0.32842689
5       a  0.85650212
6       b  0.00962240
7       b -0.06621508
8       b -1.41347823
9       b  0.08969098
10      b  1.31565582
11      c -1.26141417
12      c -0.33364069

And I want to drop factor levels which repeated less than 5 times. I developed a for-loop and it is working:

for (i in 1:length(levels(df$factor))){
  if(table(df$factor)[i] < 5){
    df.new <- df[df$factor != names(table(df$factor))[i],] 
  }
}

But do quicker and prettier solutions exists?


回答1:


What about

df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]



回答2:


require(dplyr)

df %>% group_by(factor) %>% filter(n() >= 5)
#factor   variable
#1       a  2.0769363
#2       a  0.6187513
#3       a  0.2426108
#4       a -0.4279296
#5       a  0.2270024
#6       b -0.6839748
#7       b -0.3285610
#8       b  0.2625743
#9       b -0.9532957
#10      b  1.4526317



回答3:


library(data.table)
setDT(df)[, variable[.N >= 5], by = factor]

##    factor         V1
## 1:      a -0.8204684
## 2:      a  0.4874291
## 3:      a  0.7383247
## 4:      a  0.5757814
## 5:      a -0.3053884
## 6:      b  1.5117812
## 7:      b  0.3898432
## 8:      b -0.6212406
## 9:      b -2.2146999
## 10:     b  1.1249309



回答4:


Maybe join with a filtered count of the factors:

library(dplyr)
common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5) 
df.1 <- semi_join(df, common.factors)



回答5:


Try this with base functions...

lvl = as.data.frame(table(df$factor))
colnames(lvl) = c('factor','count')
lvl
  factor count
1      a     5
2      b     5
3      c     2

df[df$factor %in% lvl[lvl$count>=5,]$factor,]
   factor    variable
1       a -0.01619026
2       a  0.94383621
3       a  0.82122120
4       a  0.59390132
5       a  0.91897737
6       b  0.78213630
7       b  0.07456498
8       b -1.98935170
9       b  0.61982575
10      b -0.05612874



回答6:


This worked for me:

df = df[df$factor %in% names(table(df$factor)) [table(df$factor) >=5],]


来源:https://stackoverflow.com/questions/24259194/elegant-way-to-drop-rare-factor-levels-from-data-frame

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!