I have a bit of a question about computing the Rolling Mean/standard deviation based on conditions. To be honest it is more of a syntax question, but since I think it is slowing down my code quite a bit I thought I should ask it here to find out what's going on. I have some finance data with columns such as Stock Name
, Midquotes
etc. and I would like to compute the rolling mean and rolling standard deviation based on the stock.
Right now I wish to compute the volatility of each stock, and this is done by taking the rolling standard deviation of the previous 20 midquotes. To this end, after searching through the stackoverflow forums, I found a line using the data.table
package as follows:
DT[, volatility:=( roll_sd(DT$Midquotes, 20, fill=0, align = "right") ), by = Stock]
Where DT
is the data.table
which contains all my data.
Now, this is quite computationally slow, especially when I compare it to a typical rolling standard deviation calculation without any conditions as given here:
DT$volatility <- roll_sd(DT$Midquotes, 20, fill=0, align = "right")
But when I try to do something similar with the rolling standard deviation with a condition, R will not let me do this:
DT$volatility <- DT[, ( roll_sd(DT$Midquotes, 20, fill=0, align = "right") ), by = Stock]
This line comes up with an error:
Error: cannot allocate vector of size 10.9 Gb
So I was just wondering, why is this line: DT[, volatility:=( roll_sd(DT$Midquotes, 20, fill=0, align = "right") ), by = Stock]
so slow? Is it perhaps making a copy of the entire data.table
each time the rolling standard deviation is computed for each different stock?
I think your problem is your use of the :=
function and that you use DT
inside the square brackets. I assume your setup is something like:
> library(data.table)
> set.seed(83385668)
> DT <- data.table(
+ x = rnorm(5 * 3),
+ stock = c(sapply(letters[1:3], rep, times = 5)),
+ time = c(replicate(3, 1:5)))
> DT
x stock time
1: 0.25073356 a 1
2: -0.24408170 a 2
3: -0.87475856 a 3
4: 0.50843761 a 4
5: -1.91331773 a 5
6: 0.07850094 b 1
7: -0.15922989 b 2
8: 1.09806870 b 3
9: 0.27995610 b 4
10: 0.45090842 b 5
11: 0.03400554 c 1
12: -0.34918734 c 2
13: 2.16602740 c 3
14: -0.04758261 c 4
15: 1.24869663 c 5
I am not sure where the roll_sd
function is from. However, you can compute e.g. a rolling mean with the zoo
library as follows:
> library(zoo)
> setkey(DT, stock, time) # make sure data is sorted by time
> DT[, rollmean := rollmean(x, k = 3, fill = 0, align = "right"),
+ by = .(stock)]
> DT
x stock time rollmean
1: 0.25073356 a 1 0.0000000
2: -0.24408170 a 2 0.0000000
3: -0.87475856 a 3 -0.2893689
4: 0.50843761 a 4 -0.2034676
5: -1.91331773 a 5 -0.7598796
6: 0.07850094 b 1 0.0000000
7: -0.15922989 b 2 0.0000000
8: 1.09806870 b 3 0.3391132
9: 0.27995610 b 4 0.4062650
10: 0.45090842 b 5 0.6096444
11: 0.03400554 c 1 0.0000000
12: -0.34918734 c 2 0.0000000
13: 2.16602740 c 3 0.6169485
14: -0.04758261 c 4 0.5897525
15: 1.24869663 c 5 1.1223805
or equivalently
> DT[, `:=`(rollmean = rollmean(x, k = 3, fill = 0, align = "right")),
+ by = .(stock)]
> DT
x stock time rollmean
1: 0.25073356 a 1 0.0000000
2: -0.24408170 a 2 0.0000000
3: -0.87475856 a 3 -0.2893689
4: 0.50843761 a 4 -0.2034676
5: -1.91331773 a 5 -0.7598796
6: 0.07850094 b 1 0.0000000
7: -0.15922989 b 2 0.0000000
8: 1.09806870 b 3 0.3391132
9: 0.27995610 b 4 0.4062650
10: 0.45090842 b 5 0.6096444
11: 0.03400554 c 1 0.0000000
12: -0.34918734 c 2 0.0000000
13: 2.16602740 c 3 0.6169485
14: -0.04758261 c 4 0.5897525
15: 1.24869663 c 5 1.1223805
There now also is a rolling mean function within data.table itself, please see github disscussion for details. The implementation is really straightforward.
DT[, rollmean := data.table::frollmean(x, n = 3, fill = 0, align = "right"),
by = .(stock)]
A quick benchmarking of the two, shows that the data.table
version is a bit quicker (most of the time).
library(microbenchmark)
microbenchmark(a = DT[, rollmean := data.table::frollmean(x, n = 3, fill = 0, align = "right"),
by = .(stock)]
, b = DT[, rollmean := rollmean(x, k = 3, fill = 0, align = "right"),
by = .(stock)]
, times = 100L
)
Unit: milliseconds
expr min lq mean median uq max neval cld
a 1.5695 1.66605 2.329675 1.79340 2.1980 39.3750 100 a
b 2.6711 2.82105 3.660617 2.99725 4.3577 20.3178 100 b
I met the same problem calculating rolling standard in my data-processing process.So I viewed this site. And I think your problem is using DT$Midquotes not .SD$Midquotes. .SD is a data.table containing the Subset of x’s Data for each group. And roll_sd function is from package"RcppRoll". You can try this way.
DT[, (sd = roll_sd(.SD$Midquotes, 20, fill=0, align = "right")), by = .(Stock)]
来源:https://stackoverflow.com/questions/46438975/rolling-mean-standard-deviation-with-conditions