subset based on frequency level [duplicate]

情到浓时终转凉″ 提交于 2019-12-06 07:59:32

问题


I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?

set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))

回答1:


df1[ df1$ID %in%  names(table(df1$ID))[table(df1$ID) >9] , ]

This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.

See:

?`[`
?'%in%'



回答2:


Maybe closer to what you had in mind is to create a vector of frequencies using ave:

subset(df1, ave(ID, ID, FUN = length) > cutoff)



回答3:


Using dplyr

library(dplyr)
 df1 %>% 
 group_by(ID) %>% 
 filter(n()>cutoff)


来源:https://stackoverflow.com/questions/24835233/subset-based-on-frequency-level

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!