How do you subset a data frame in R based on a minimum sample size

问题

Let's say you have a data frame with two levels of factors that looks like this:

Factor1    Factor2    Value
A          1          0.75
A          1          0.34
A          2          1.21   
A          2          0.75 
A          2          0.53
B          1          0.42
B          2          0.21  
B          2          0.18
B          2          1.42

etc.

How do I subset this data frame ("df", if you will) based on the condition that the combination of Factor1 and Factor2 (Fact1*Fact2) has more than, say, 2 observations? Can you use the length argument in subset to do this?

回答1:

Assuming your data.frame is called mydf, you can use ave to create a logical vector to help subset:

mydf[with(mydf, as.logical(ave(Factor1, Factor1, Factor2, 
                           FUN = function(x) length(x) > 2))), ]
#   Factor1 Factor2 Value
# 3       A       2  1.21
# 4       A       2  0.75
# 5       A       2  0.53
# 7       B       2  0.21
# 8       B       2  0.18
# 9       B       2  1.42

Here's ave counting up your combinations. Notice that ave returns an object the same length as the number of rows in your data.frame (this makes it convenient for subsetting).

> with(mydf, ave(Factor1, Factor1, Factor2, FUN = length))
[1] "2" "2" "3" "3" "3" "1" "3" "3" "3"

The next step is to compare that length to your threshold. For that we need an anonymous function for our FUN argument.

> with(mydf, ave(Factor1, Factor1, Factor2, FUN = function(x) length(x) > 2))
[1] "FALSE" "FALSE" "TRUE"  "TRUE"  "TRUE"  "FALSE" "TRUE"  "TRUE"  "TRUE"

Almost there... but since the first item was a character vector, our output is also a character vector. We want it as.logical so we can directly use it for subsetting.

ave doesn't work on objects of class factor, in which case you'll need to do something like:

mydf[with(mydf, as.logical(ave(as.character(Factor1), Factor1, Factor2, 
                               FUN = function(x) length(x) > 2))),]

回答2:

library(data.table)

dt = data.table(your_df)

dt[, if(.N > 2) .SD, list(Factor1, Factor2)]
#   Factor1 Factor2 Value
#1:       A       2  1.21
#2:       A       2  0.75
#3:       A       2  0.53
#4:       B       2  0.21
#5:       B       2  0.18
#6:       B       2  1.42

回答3:

You can use interaction and table to see the number of observation for each interaction (mydata is your data) and then use %in% to subset the data.

 mydata$inter<-with(mydata,interaction(Factor1,Factor2))
 table(mydata$inter)
A.1 B.1 A.2 B.2 
  2   1   3   3 

mydata[!mydata$inter %in% c("A.1","B.1"), ]
  Factor1 Factor2 Value inter
3       A       2  1.21   A.2
4       A       2  0.75   A.2
5       A       2  0.53   A.2
7       B       2  0.21   B.2
8       B       2  0.18   B.2
9       B       2  1.42   B.2

Updated as per @Ananda's comment:You can use following one line code after creating the interaction variable.

mydata[mydata$inter %in% names(which(table(mydata$inter) > 2)), ]

来源：https://stackoverflow.com/questions/18257961/how-do-you-subset-a-data-frame-in-r-based-on-a-minimum-sample-size

标签

subset