Rolling Mean/standard deviation with conditions

问题

I have a bit of a question about computing the Rolling Mean/standard deviation based on conditions. To be honest it is more of a syntax question, but since I think it is slowing down my code quite a bit I thought I should ask it here to find out what's going on. I have some finance data with columns such as Stock Name, Midquotes etc. and I would like to compute the rolling mean and rolling standard deviation based on the stock.

Right now I wish to compute the volatility of each stock, and this is done by taking the rolling standard deviation of the previous 20 midquotes. To this end, after searching through the stackoverflow forums, I found a line using the data.table package as follows:

DT[, volatility:=( roll_sd(DT$Midquotes, 20, fill=0, align = "right") ), by = Stock]

Where DT is the data.table which contains all my data.

Now, this is quite computationally slow, especially when I compare it to a typical rolling standard deviation calculation without any conditions as given here:

DT$volatility <- roll_sd(DT$Midquotes, 20, fill=0, align = "right")

But when I try to do something similar with the rolling standard deviation with a condition, R will not let me do this:

DT$volatility <- DT[, ( roll_sd(DT$Midquotes, 20, fill=0, align = "right") ), by = Stock]

This line comes up with an error:

Error: cannot allocate vector of size 10.9 Gb

So I was just wondering, why is this line: DT[, volatility:=( roll_sd(DT$Midquotes, 20, fill=0, align = "right") ), by = Stock] so slow? Is it perhaps making a copy of the entire data.table each time the rolling standard deviation is computed for each different stock?

回答1:

I think your problem is your use of the := function and that you use DT inside the square brackets. I assume your setup is something like:

> library(data.table)
> set.seed(83385668)
> DT <- data.table(
+   x     = rnorm(5 * 3), 
+   stock = c(sapply(letters[1:3], rep, times = 5)),
+   time  = c(replicate(3, 1:5)))
> DT
              x stock time
 1:  0.25073356     a    1
 2: -0.24408170     a    2
 3: -0.87475856     a    3
 4:  0.50843761     a    4
 5: -1.91331773     a    5
 6:  0.07850094     b    1
 7: -0.15922989     b    2
 8:  1.09806870     b    3
 9:  0.27995610     b    4
10:  0.45090842     b    5
11:  0.03400554     c    1
12: -0.34918734     c    2
13:  2.16602740     c    3
14: -0.04758261     c    4
15:  1.24869663     c    5

I am not sure where the roll_sd function is from. However, you can compute e.g. a rolling mean with the zoo library as follows:

> library(zoo)
> setkey(DT, stock, time) # make sure data is sorted by time
> DT[, rollmean := rollmean(x, k = 3, fill = 0, align = "right"), 
+    by = .(stock)]
> DT
              x stock time   rollmean
 1:  0.25073356     a    1  0.0000000
 2: -0.24408170     a    2  0.0000000
 3: -0.87475856     a    3 -0.2893689
 4:  0.50843761     a    4 -0.2034676
 5: -1.91331773     a    5 -0.7598796
 6:  0.07850094     b    1  0.0000000
 7: -0.15922989     b    2  0.0000000
 8:  1.09806870     b    3  0.3391132
 9:  0.27995610     b    4  0.4062650
10:  0.45090842     b    5  0.6096444
11:  0.03400554     c    1  0.0000000
12: -0.34918734     c    2  0.0000000
13:  2.16602740     c    3  0.6169485
14: -0.04758261     c    4  0.5897525
15:  1.24869663     c    5  1.1223805

or equivalently

> DT[, `:=`(rollmean = rollmean(x, k = 3, fill = 0, align = "right")), 
+    by = .(stock)]
> DT
              x stock time   rollmean
 1:  0.25073356     a    1  0.0000000
 2: -0.24408170     a    2  0.0000000
 3: -0.87475856     a    3 -0.2893689
 4:  0.50843761     a    4 -0.2034676
 5: -1.91331773     a    5 -0.7598796
 6:  0.07850094     b    1  0.0000000
 7: -0.15922989     b    2  0.0000000
 8:  1.09806870     b    3  0.3391132
 9:  0.27995610     b    4  0.4062650
10:  0.45090842     b    5  0.6096444
11:  0.03400554     c    1  0.0000000
12: -0.34918734     c    2  0.0000000
13:  2.16602740     c    3  0.6169485
14: -0.04758261     c    4  0.5897525
15:  1.24869663     c    5  1.1223805

回答2:

There now also is a rolling mean function within data.table itself, please see github disscussion for details. The implementation is really straightforward.

DT[, rollmean := data.table::frollmean(x, n = 3, fill = 0, align = "right"), 
by = .(stock)]

A quick benchmarking of the two, shows that the data.table version is a bit quicker (most of the time).

library(microbenchmark)

microbenchmark(a = DT[, rollmean := data.table::frollmean(x, n = 3, fill = 0, align = "right"), 
                      by = .(stock)]
               , b = DT[, rollmean := rollmean(x, k = 3, fill = 0, align = "right"),
                            by = .(stock)]
, times = 100L

)

Unit: milliseconds
expr    min      lq     mean  median     uq     max neval cld
   a 1.5695 1.66605 2.329675 1.79340 2.1980 39.3750   100  a 
   b 2.6711 2.82105 3.660617 2.99725 4.3577 20.3178   100   b

回答3:

I met the same problem calculating rolling standard in my data-processing process.So I viewed this site. And I think your problem is using DT$Midquotes not .SD$Midquotes. .SD is a data.table containing the Subset of x’s Data for each group. And roll_sd function is from package"RcppRoll". You can try this way.

DT[, (sd = roll_sd(.SD$Midquotes, 20, fill=0, align = "right")), by = .(Stock)]

来源：https://stackoverflow.com/questions/46438975/rolling-mean-standard-deviation-with-conditions

标签

data.table

moving-average