subset based on frequency level [duplicate]

问题

I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?

set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))

回答1:

df1[ df1$ID %in%  names(table(df1$ID))[table(df1$ID) >9] , ]

This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.

See:

?`[`
?'%in%'

回答2:

Maybe closer to what you had in mind is to create a vector of frequencies using ave:

subset(df1, ave(ID, ID, FUN = length) > cutoff)

回答3:

Using dplyr

library(dplyr)
 df1 %>% 
 group_by(ID) %>% 
 filter(n()>cutoff)

来源：https://stackoverflow.com/questions/24835233/subset-based-on-frequency-level

标签

subset

frequency