R - Keep first observation per group identified by multiple variables (Stata equivalent “bys var1 var2 : keep if _n == 1”)

后端未结

关注

 3  619

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.

Using the data.frame below, the

相关标签:

3条回答

鱼传尺愫

2020-12-15 00:54

I would order the data.frame at which point you can look into using by:

mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]

do.call(rbind, by(mydata, list(mydata$id, mydata$day), 
                  FUN=function(x) head(x, 1)))

Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:

library(data.table)

DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
#     id day value
#  1:  1   1    10
#  2:  1   2    15
#  3:  1   3    20
#  4:  2   1    40
#  5:  2   2    30
#  6:  3   2    22
#  7:  3   3    24
#  8:  4   1    11
#  9:  4   2    11
# 10:  4   3    12

Or, starting from scratch, you can use data.table in the following way:

DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]

And, by extension, in base R:

Ranks <- with(mydata, ave(value, id, day, FUN = function(x) 
  rank(x, ties.method="first")))
mydata[Ranks == 1, ]

0 讨论(0)

遥遥无期

2020-12-15 01:03
The package dplyr makes this kind of things easier.
```
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
```
This command requires more memory in R than in Stata: rows are not suppressed in place, a new copy of the dataset is created.
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-12-15 01:09
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
```
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
```
Using dplyr with magrittr pipes:
```
library(dplyr)
mydata <- mydata %>%
  group_by(id, day) %>%
  slice(1) %>%
  ungroup()
```
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.
0 讨论(0)
发布评论:

提交评论
- 加载中...