How to extract first n rows per group and calculate function using that subset?

问题

My question is very similar to this one: How to extract the first n rows per group?

dt
         date age     name       val
1: 2000-01-01   3   Andrew  93.73546
2: 2000-01-01   4      Ben 101.83643
3: 2000-01-01   5  Charlie  91.64371
4: 2000-01-02   6     Adam 115.95281
5: 2000-01-02   7      Bob 103.29508
6: 2000-01-02   8 Campbell  91.79532

We have a dt and I've added an extra column named val. First, we want to extract the first n rows within each group. The solutions from the link provided are:

dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed

My question is how do I apply a function to the first n rows within each group if that function depends on the subsetted information. I am trying to apply something like this:

  # uses other columns for results/ is dependent on subsetted rows
  # but keep it simple for replication
do_something <- function(dt){
  res <- ifelse(cumsum(dt$val) > 200, 1, 0)  
  return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]

          date age   name       val funcVal
1: 2000-01-01   3 Andrew  93.73546       0
2: 2000-01-01   4    Ben 101.83643       1
3: 2000-01-02   6   Adam 115.95281       0
4: 2000-01-02   7    Bob 103.29508       1

Am I going about this wrong? Is there a more efficient way to do this? I cannot seem to figure out how to apply the "for speed" solution to this. Is there a way to do this without saving the subset-ed results first and applying the function to the first 2 rows by date right away?

Any help is appreciated and below is the code to produce the data above:

date <- c("2000-01-01","2000-01-01","2000-01-01",
          "2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
val <- val <- rnorm(6,100,10)
dt <- data.table(date, age, name,val)

回答1:

In case there's more than one grouping column, it might be more efficient to collapse to one:

m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]

This is just an extension to @eddi's approach (of keeping row numbers .I, seen in @akrun's answer) to also keep group counter .GRP.

Re OP's comment that they're more concerned about the function, well, borrowing from @akrun, there's ...

ff = function(x) as.integer(cumsum(x[[1]]) > 200)

Assuming all values are nonnegative, you could probably handle this in C more efficiently, since the cumulative sum can stop as soon as the threshold is reached. For the special case of two rows, that will hardly matter, though.

My impression is that this is a dummy function so there's no point going there. Many efficiency improvements that I usually think of are contingent on the function and data.

回答2:

We can use as.integer on the cumsum to coerce the logical to binary. Extract the row index, specify it as i, grouped by 'date', apply the function on the 'val' column

f1 <- function(x) as.integer(cumsum(x) > 200)
i1 <- dt[, .I[1:2], by = date]$V1
dt[i1, newcol := f1(val), date]

来源：https://stackoverflow.com/questions/52932682/how-to-extract-first-n-rows-per-group-and-calculate-function-using-that-subset

标签

data.table