R - Keep first observation per group identified by multiple variables (Stata equivalent “bys var1 var2 : keep if _n == 1”)

后端 未结 3 615
孤独总比滥情好
孤独总比滥情好 2020-12-15 00:27

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.

Using the data.frame below, the

相关标签:
3条回答
  • 2020-12-15 00:54

    I would order the data.frame at which point you can look into using by:

    mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
    
    do.call(rbind, by(mydata, list(mydata$id, mydata$day), 
                      FUN=function(x) head(x, 1)))
    

    Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:

    library(data.table)
    
    DT <- data.table(mydata, key = "id,day")
    DT[, head(.SD, 1), by = key(DT)]
    #     id day value
    #  1:  1   1    10
    #  2:  1   2    15
    #  3:  1   3    20
    #  4:  2   1    40
    #  5:  2   2    30
    #  6:  3   2    22
    #  7:  3   3    24
    #  8:  4   1    11
    #  9:  4   2    11
    # 10:  4   3    12
    

    Or, starting from scratch, you can use data.table in the following way:

    DT <- data.table(id, day, value, key = "id,day")
    DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
    

    And, by extension, in base R:

    Ranks <- with(mydata, ave(value, id, day, FUN = function(x) 
      rank(x, ties.method="first")))
    mydata[Ranks == 1, ]
    
    0 讨论(0)
  • 2020-12-15 01:03

    The package dplyr makes this kind of things easier.

    library(dplyr)
    mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
    

    This command requires more memory in R than in Stata: rows are not suppressed in place, a new copy of the dataset is created.

    0 讨论(0)
  • 2020-12-15 01:09

    Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:

    library(data.table)
    mydata <- data.table(my.data)
    mydata <- mydata[, .SD[1], by = .(id, day)]
    

    Using dplyr with magrittr pipes:

    library(dplyr)
    mydata <- mydata %>%
      group_by(id, day) %>%
      slice(1) %>%
      ungroup()
    

    If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.

    0 讨论(0)
提交回复
热议问题