R data.table grouping for lagged regression

后端 未结 3 1680
故里飘歌
故里飘歌 2020-12-05 11:59

table with data (its a data.table object) that looks like the following :

      date         stock_id logret
   1: 2011-01-01        1  0.001
   2: 2011-01-0         


        
3条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-05 12:00

    Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

    What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

    DT[, lagret := xxx]
    

    where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

    DT[, lagret := seq(from=1, to=nrow(DT))]
    

    Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

    setkeyv(DT,c('stock_id','date'))
    DT[list(stock_id,date-1)]
        stock_id       date logret
     1:        1 2010-12-31     NA
     2:        1 2011-01-01  0.001
     3:        1 2011-01-02  0.003
     4:        1 2011-01-03  0.005
     5:        1 2011-01-04  0.007
     6:        1 2011-01-05  0.009
    ...
    

    As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

    DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
              date stock_id logret lagret
     1: 2011-01-01        1  0.001     NA
     2: 2011-01-02        1  0.003  0.001
     3: 2011-01-03        1  0.005  0.003
     4: 2011-01-04        1  0.007  0.005
     5: 2011-01-05        1  0.009  0.007
    ...
    

    This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

    DT <- DT[-c(4, 5)]
    setkeyv(DT,c('stock_id','date'))
    DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
              date stock_id logret lagret
     1: 2011-01-01        1  0.001     NA
     2: 2011-01-02        1  0.003  0.001
     3: 2011-01-03        1  0.005  0.003
     4: 2011-01-06        1  0.011     NA
     5: 2011-01-01        2  0.013     NA
    ...
    

    As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

    DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
              date stock_id logret lagret
     1: 2011-01-01        1  0.001     NA
     2: 2011-01-02        1  0.003  0.001
     3: 2011-01-03        1  0.005  0.003
     4: 2011-01-06        1  0.011  0.005
     5: 2011-01-01        2  0.013     NA
    ...
    

    Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

提交回复
热议问题