问题
I would like to subset an unbalanced panel data set by group. For each group, I would like to keep the two observations in the first and the last years.
How do I best do this in R? For example:
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)),
year=c(2001:2003,2000,2002,2000:2001,2003))
> dt
name year
1 A 2001
2 A 2002
3 A 2003
4 B 2000
5 B 2002
6 C 2000
7 C 2001
8 C 2003
What I would like to have:
name year
1 A 2001
3 A 2003
4 B 2000
5 B 2002
6 C 2000
8 C 2003
回答1:
dplyr should help. check out first() & last() to get the values you are looking for and then filter based on those values.
dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), year=c(2001:2003,2000,2002,2000:2001,2003))
library(dplyr)
dt %>%
group_by(name) %>%
mutate(first = first(year)
,last = last(year)) %>%
filter(year == first | year == last) %>%
select(name, year)
name year
1 A 2001
2 A 2003
3 B 2000
4 B 2002
5 C 2000
6 C 2003
*your example to didn't mention any specific order but it that case, arrange() will help
回答2:
Here's a quick possible data.table
solution
library(data.table)
setDT(dt)[, .SD[c(1L, .N)], by = name]
# name year
# 1: A 2001
# 2: A 2003
# 3: B 2000
# 4: B 2002
# 5: C 2000
# 6: C 2003
Or if you only have two columns
dt[, year[c(1L, .N)], by = name]
回答3:
This is pretty simple using by
to split the data.frame by group and then return the head and tail of each group.
> do.call(rbind, by(dt, dt$name, function(x) rbind(head(x,1),tail(x,1))))
name year
A.1 A 2001
A.3 A 2003
B.4 B 2000
B.5 B 2002
C.6 C 2000
C.8 C 2003
head
and tail
are convenient, but slow, so a slightly different alternative would probably be faster on a large data.frame:
do.call(rbind, by(dt, dt$name, function(x) x[c(1,nrow(x)),]))
来源:https://stackoverflow.com/questions/29376672/subset-panel-data-by-group