问题
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
回答1:
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [
-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'
回答2:
Maybe closer to what you had in mind is to create a vector of frequencies using ave
:
subset(df1, ave(ID, ID, FUN = length) > cutoff)
回答3:
Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
来源:https://stackoverflow.com/questions/24835233/subset-based-on-frequency-level