Retain and lag function in R as SAS

后端未结

关注

 4  645

礼貌的吻别 2021-01-05 10:05

I am looking for a function in R similar to lag1, lag2 and retain functions in SAS which I can use with data.tables.

I know th

4条回答

情歌与酒 (楼主)

2021-01-05 10:38

I would say the closet equivalent to retain, lag1, and lag2 would be the Lag function in the quantmod package.

It's very easy to use with data.tables. E.g.:

library(data.table)
library(quantmod)
d <- data.table(v1=c(rep('a', 10), rep('b', 10)), v2=1:20)
setkeyv(d, 'v1')
d[,new_var := Lag(v2, 1), by='v1']
d[,new_var2 := v2-Lag(v2, 3), by='v1']
d[,new_var3 := Next(v2, 2), by='v1']

This yields the following:

print(d)
    v1 v2 new_var new_var2 new_var3
 1:  a  1      NA       NA        3
 2:  a  2       1       NA        4
 3:  a  3       2       NA        5
 4:  a  4       3        3        6
 5:  a  5       4        3        7
 6:  a  6       5        3        8
 7:  a  7       6        3        9
 8:  a  8       7        3       10
 9:  a  9       8        3       NA
10:  a 10       9        3       NA
11:  b 11      NA       NA       13
12:  b 12      11       NA       14
13:  b 13      12       NA       15
14:  b 14      13        3       16
15:  b 15      14        3       17
16:  b 16      15        3       18
17:  b 17      16        3       19
18:  b 18      17        3       20
19:  b 19      18        3       NA
20:  b 20      19        3       NA

As you can see, Lag lets you look back and Next lets you look forward. Both functions are nice because they pad the result with NAs such that it has the same length as the input.

If you want to get even fancier, and higher-performance, you can look into rolling joins with data.table objects. This is a little bit different thab what you are asking for, but is conceptually related, and so powerful and awesome I have to share.

Start with a data.table:

library(data.table)
library(quantmod)
set.seed(42)
d1 <- data.table(
    id=c(rep('a', 10), rep('b', 10)), 
    time=rep(1:10,2), 
    value=runif(20))
setkeyv(d1, c('id', 'time'))
print(d1)

    id time     value
 1:  a    1 0.9148060
 2:  a    2 0.9370754
 3:  a    3 0.2861395
 4:  a    4 0.8304476
 5:  a    5 0.6417455
 6:  a    6 0.5190959
 7:  a    7 0.7365883
 8:  a    8 0.1346666
 9:  a    9 0.6569923
10:  a   10 0.7050648
11:  b    1 0.4577418
12:  b    2 0.7191123
13:  b    3 0.9346722
14:  b    4 0.2554288
15:  b    5 0.4622928
16:  b    6 0.9400145
17:  b    7 0.9782264
18:  b    8 0.1174874
19:  b    9 0.4749971
20:  b   10 0.5603327

You have another data.table you want to join, but not all time indexes are present in the second table:

d2 <- data.table(
        id=sample(c('a', 'b'), 5, replace=TRUE), 
        time=sample(1:10, 5), 
        value2=runif(5))
setkeyv(d2, c('id', 'time'))
print(d2)
   id time      value2
1:  a    4 0.811055141
2:  a   10 0.003948339
3:  b    6 0.737595618
4:  b    8 0.388108283
5:  b    9 0.685169729

A regular merge yields lots of missing values:

d2[d1,,roll=FALSE]
    id time      value2     value
 1:  a    1          NA 0.9148060
 2:  a    2          NA 0.9370754
 3:  a    3          NA 0.2861395
 4:  a    4 0.811055141 0.8304476
 5:  a    5          NA 0.6417455
 6:  a    6          NA 0.5190959
 7:  a    7          NA 0.7365883
 8:  a    8          NA 0.1346666
 9:  a    9          NA 0.6569923
10:  a   10 0.003948339 0.7050648
11:  b    1          NA 0.4577418
12:  b    2          NA 0.7191123
13:  b    3          NA 0.9346722
14:  b    4          NA 0.2554288
15:  b    5          NA 0.4622928
16:  b    6 0.737595618 0.9400145
17:  b    7          NA 0.9782264
18:  b    8 0.388108283 0.1174874
19:  b    9 0.685169729 0.4749971
20:  b   10          NA 0.5603327

However, data.table allows you to roll the secondary index forward, WITHIN THE PRIMARY INDEX!

d2[d1,,roll=TRUE]
    id time      value2     value
 1:  a    1          NA 0.9148060
 2:  a    2          NA 0.9370754
 3:  a    3          NA 0.2861395
 4:  a    4 0.811055141 0.8304476
 5:  a    5 0.811055141 0.6417455
 6:  a    6 0.811055141 0.5190959
 7:  a    7 0.811055141 0.7365883
 8:  a    8 0.811055141 0.1346666
 9:  a    9 0.811055141 0.6569923
10:  a   10 0.003948339 0.7050648
11:  b    1          NA 0.4577418
12:  b    2          NA 0.7191123
13:  b    3          NA 0.9346722
14:  b    4          NA 0.2554288
15:  b    5          NA 0.4622928
16:  b    6 0.737595618 0.9400145
17:  b    7 0.737595618 0.9782264
18:  b    8 0.388108283 0.1174874
19:  b    9 0.685169729 0.4749971
20:  b   10 0.685169729 0.5603327

This is pretty damn cool: Old observations are rolled forward in time, until they are replaced by new ones. If you want to replace the NA values at the beggining of the series, you can do so by rolling the first observation backwards:

d2[d1,,roll=TRUE, rollends=c(TRUE, TRUE)]
    id time      value2     value
 1:  a    1 0.811055141 0.9148060
 2:  a    2 0.811055141 0.9370754
 3:  a    3 0.811055141 0.2861395
 4:  a    4 0.811055141 0.8304476
 5:  a    5 0.811055141 0.6417455
 6:  a    6 0.811055141 0.5190959
 7:  a    7 0.811055141 0.7365883
 8:  a    8 0.811055141 0.1346666
 9:  a    9 0.811055141 0.6569923
10:  a   10 0.003948339 0.7050648
11:  b    1 0.737595618 0.4577418
12:  b    2 0.737595618 0.7191123
13:  b    3 0.737595618 0.9346722
14:  b    4 0.737595618 0.2554288
15:  b    5 0.737595618 0.4622928
16:  b    6 0.737595618 0.9400145
17:  b    7 0.737595618 0.9782264
18:  b    8 0.388108283 0.1174874
19:  b    9 0.685169729 0.4749971
20:  b   10 0.685169729 0.5603327

These rolling joins are absolutely incredible, and I've never seen them implemented in any other open source package (see ?data.table for more info). It will take a little while to turn off your "SAS brain" and turn on your "R brain", but once you get over that initial hump you'll find that the language is much more expressive.

0 讨论(0)

查看其它4个回答