问题
Maybe it's something basic, but I couldn't find the answer.
I have
Id Year V1
1 2009 33
1 2010 67
1 2011 38
2 2009 45
3 2009 65
3 2010 74
4 2009 47
4 2010 51
4 2011 14
I need to select only the rows that have the same Id but it´s in the three years 2009, 2010 and 2011.
Id Year V1
1 2009 33
1 2010 67
1 2011 38
4 2009 47
4 2010 51
4 2011 14
I try
d1_3 <- subset(d1, Year==2009 |Year==2010 |Year==2011 )
but it doesn't work.
Can anyone provide some suggestions that how I can do this in R?
回答1:
I think ave
could be useful here. I call your original data frame 'df'. For each Id, check if 2009-2011 is present in Year (2009:2011 %in% x
). This gives a logical vector, which can be sum
med. Test if the sum equals 3 (if all Years are present, the sum is 3), which results in a new logical vector, which is used to subset rows of the data frame.
df[ave(df$Year, df$Id, FUN = function(x) sum(2009:2011 %in% x) == 3, ]
# Id Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
回答2:
Another way of using ave
DF
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 4 2 2009 45
## 5 3 2009 65
## 6 3 2010 74
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
DF[ave(DF$Year, DF$Id, FUN = function(x) all(2009:2011 %in% x)) == 1, ]
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
回答3:
This should do the job :)
library(plyr)
ds<-ddply(ds,.(Id),mutate,Nobs=length(Year))
ds[ds$Nobs == 3 & ds$Year %in% 2009:2011,]
回答4:
I think an approach using ave
is reasonable. But there are lots of ways to solve this problem. I show a few other ways using base R. Then in the last 2 examples I'll introduce the package data.table
.
Again, just throwing this out there to provide some options to use different aspects of the language.
d1 <- data.frame(ID=c(1,1,1,2,3,3,4,4,4), Year=c(2009,2010,2011, 2009,2009, 2010, 2009, 2010, 2011), V1=c(33, 67, 38, 45, 65, 74, 47, 51, 14))
# long way
use_years <- as.character(2009:2011)
cnts <- table(d1[,c("ID","Year")])[,use_years]
use_id <- rownames(cnts)[rowSums(cnts)==length(use_years)]
d1[d1[,"ID"]%in%use_id,]
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
# another longish way
ind1 <- d1[,"Year"]%in%2009:2011
d1_ind <- d1[ind1,"ID"]
ind2 <- d1_ind %in% unique(d1_ind)[tabulate(d1_ind)==3]
d1[ind1,][ind2,]
# ID Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
OK, let's try out a couple methods using data.table. One of my favorite packages of all time. Can be a little tricky at first though, so make sure your boots are on tight (Oh, yeah, it's fast!) :)
# medium way
library(data.table)
d2 <- as.data.table(d1)
d2[ID%in%d2[Year%in%2009:2011, list(logic=nrow(.SD)==3),by="ID"][(logic),ID]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14
# short way
d2[Year%in%2009:2011][ID%in%unique(ID)[table(ID)==3]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14
来源:https://stackoverflow.com/questions/22475189/subset-by-multiple-conditions