How do you subset a data frame in R based on a minimum sample size

空扰寡人 提交于 2019-12-10 21:36:27

问题


Let's say you have a data frame with two levels of factors that looks like this:

Factor1    Factor2    Value
A          1          0.75
A          1          0.34
A          2          1.21   
A          2          0.75 
A          2          0.53
B          1          0.42
B          2          0.21  
B          2          0.18
B          2          1.42

etc.

How do I subset this data frame ("df", if you will) based on the condition that the combination of Factor1 and Factor2 (Fact1*Fact2) has more than, say, 2 observations? Can you use the length argument in subset to do this?


回答1:


Assuming your data.frame is called mydf, you can use ave to create a logical vector to help subset:

mydf[with(mydf, as.logical(ave(Factor1, Factor1, Factor2, 
                           FUN = function(x) length(x) > 2))), ]
#   Factor1 Factor2 Value
# 3       A       2  1.21
# 4       A       2  0.75
# 5       A       2  0.53
# 7       B       2  0.21
# 8       B       2  0.18
# 9       B       2  1.42

Here's ave counting up your combinations. Notice that ave returns an object the same length as the number of rows in your data.frame (this makes it convenient for subsetting).

> with(mydf, ave(Factor1, Factor1, Factor2, FUN = length))
[1] "2" "2" "3" "3" "3" "1" "3" "3" "3"

The next step is to compare that length to your threshold. For that we need an anonymous function for our FUN argument.

> with(mydf, ave(Factor1, Factor1, Factor2, FUN = function(x) length(x) > 2))
[1] "FALSE" "FALSE" "TRUE"  "TRUE"  "TRUE"  "FALSE" "TRUE"  "TRUE"  "TRUE" 

Almost there... but since the first item was a character vector, our output is also a character vector. We want it as.logical so we can directly use it for subsetting.


ave doesn't work on objects of class factor, in which case you'll need to do something like:

mydf[with(mydf, as.logical(ave(as.character(Factor1), Factor1, Factor2, 
                               FUN = function(x) length(x) > 2))),]



回答2:


library(data.table)

dt = data.table(your_df)

dt[, if(.N > 2) .SD, list(Factor1, Factor2)]
#   Factor1 Factor2 Value
#1:       A       2  1.21
#2:       A       2  0.75
#3:       A       2  0.53
#4:       B       2  0.21
#5:       B       2  0.18
#6:       B       2  1.42



回答3:


You can use interaction and table to see the number of observation for each interaction (mydata is your data) and then use %in% to subset the data.

 mydata$inter<-with(mydata,interaction(Factor1,Factor2))
 table(mydata$inter)
A.1 B.1 A.2 B.2 
  2   1   3   3 

mydata[!mydata$inter %in% c("A.1","B.1"), ]
  Factor1 Factor2 Value inter
3       A       2  1.21   A.2
4       A       2  0.75   A.2
5       A       2  0.53   A.2
7       B       2  0.21   B.2
8       B       2  0.18   B.2
9       B       2  1.42   B.2

Updated as per @Ananda's comment:You can use following one line code after creating the interaction variable.

mydata[mydata$inter %in% names(which(table(mydata$inter) > 2)), ]


来源:https://stackoverflow.com/questions/18257961/how-do-you-subset-a-data-frame-in-r-based-on-a-minimum-sample-size

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!