问题
I have several variables at annual frequency in R that I would like to include in a regression analysis with other variables available at quarterly frequency. Additionally, I would like to be able to convert the quarterly data back to annual frequency in a way that reproduces the original annual data.
My current approach when converting from low frequency to high frequency time series data is to use the na.spline function in the zoo package. However, I don’t see how to constrain the quarterly data to match the corresponding annual average. As a result, when I convert the data back from quarterly to annual frequency, I get annual values that differ from the original series.
Reproducible example:
library(zoo)
# create annual example series
a <- as.numeric(c("100", "110", "111"))
b <- as.Date(c("2000-01-01", "2001-01-01", "2002-01-01"))
z_a <- zoo(a, b); z_a
# current approach using na.spline in zoo package
end_z <- as.Date(as.yearqtr(end(z_a))+ 3/4)
z_q <- na.spline(z_a, xout = seq(start(z_a), end_z, by = "quarter"), method = "hyman")
# result, with first quarter equal to annual value
c <- merge(z_a, z_q); c
# convert back to annual using aggregate in zoo package
# At this point I would want both series to be equal, but they aren't.
d <- aggregate(c, as.integer(format(index(c),"%Y")), mean, na.rm=TRUE); d
Storing the original annual data is one solution, or I could convert back by taking the first quarter value as the annual values. But either approach adds complexity because I would need to keep track of which of my quarterly series had originally be converted from annual data.
I would prefer a solution within the zoo or xts packages, but alternative suggestions are also welcome.
Edited to include Approach #1 Proposed by G. Grothendieck
# Approach 1
yr <- format(time(c), "%Y")
c$z_q_adj <- ave(coredata(c$z_q), yr, FUN = function(x) x - mean(x) + x[1]); c
# simple plot
dat <- c%>%
data.frame(date=time(.), .) %>%
gather(variable, value, -date)
ggplot(data=dat, aes(x=date, y=value, group=variable, color=variable)) +
geom_line() +
geom_point() +
theme(legend.position=c(.7, .4)) +
geom_point(data = subset(dat,variable == "z_a"), colour="red", shape=1, size=7)
This is a clean, effective suggestion. However, the initial challenge I have with Approach 1 is that it has the potential to result in jump-offs between Q4 and Q1 (e.g. 2001Q1 relative to the prior quarter as shown in the plot). These would imply fast growth in a single quarter. Part of the solution may be to convert from annual to monthly, using the annual value for June, then spline, then apply Approach 1 as proposed by G. Grothendieck, and then convert to quarterly.
Other research:
- I've reviewed the documentation for zoo and searched extensively through frequency conversion discussions in r. Maybe there is an argument in na.approx or na.spline that I'm overlooking?
- I've looked at the cobs package ("COnstrained B-Splines"). Maybe it would work, but the option to constrain values to average to a particular series is not readily apparent to me. I'm willing to invest more time to learn how to use it, if it's the best approach.
- Related questions include:
- https://stackoverflow.com/questions/26888433/spline-constraint
- https://stackoverflow.com/questions/32577348/interpolating-annual-data-to-quarterly-with-tidyr
- I am familiar with Eviews, the econometric software, which offers such low to high frequency conversion with a “Quadratic-match average” setting that accomplishes the desired result.
回答1:
A bit late here, but the tempdisagg package does what you want. It ensures that either the sum, the average, the first or the last value of the resulting high frequency series is consistent with the low frequency series.
It also allows you to use external indicator series, e.g., by the Chow-Lin technique. If you don't have it, the Denton-Cholette method produces a better result than the method in Eviews.
Here's your example:
# need ts object as input
z_a <- ts(c(100, 110, 111), start = 2000)
library(tempdisagg)
z_q <- predict(td(z_a ~ 1, method = "denton-cholette", conversion = "average"))
z_q
# Qtr1 Qtr2 Qtr3 Qtr4
# 2000 97.65795 98.59477 100.46841 103.27887
# 2001 107.02614 109.71460 111.34423 111.91503
# 2002 111.42702 111.06100 110.81699 110.69499
# which has the same means as your original series:
tapply(z_q, floor(time(z_q)), mean)
# 2000 2001 2002
# 100 110 111
回答2:
We could manipulate the output of na.spline
to ensure that it averages to the annual values by shifting the 4 quarters' values or shifting the last 3 quarters' values. In the first case we would subtract the mean of the 4 quarters from each quarter and then add the annual value to each quarter. In the second case we subtract the mean of the last 3 quarters from the last 3 quarters and add the annual.
In each case averaging the z_q_adj
values over the four quarters of a year will recover the original annual value.
Here are the two approaches mentioned:
# 1
yr <- format(time(c), "%Y")
c$z_q_adj <- ave(coredata(c$z_q), yr, FUN = function(x) x - mean(x) + x[1])
giving:
> c
z_a z_q z_q_adj
2000-01-01 100 100.0000 95.36604
2000-04-01 NA 103.4434 98.80946
2000-07-01 NA 106.4080 101.77405
2000-10-01 NA 108.6844 104.05046
2001-01-01 110 110.0000 109.39295
2001-04-01 NA 110.5723 109.96527
2001-07-01 NA 110.8719 110.26484
2001-10-01 NA 110.9840 110.37694
2002-01-01 111 111.0000 110.86116
2002-04-01 NA 111.0150 110.87615
2002-07-01 NA 111.1219 110.98311
2002-10-01 NA 111.4184 111.27958
# 2
c$z_q_adj <- ave(coredata(c$z_q), yr, FUN = function(x) c(x[1], x[-1] - mean(x[-1]) +x[1]))
giving:
> c
z_a z_q z_q_adj
2000-01-01 100 100.0000 100.0000
2000-04-01 NA 103.4434 97.2648
2000-07-01 NA 106.4080 100.2294
2000-10-01 NA 108.6844 102.5058
2001-01-01 110 110.0000 110.0000
2001-04-01 NA 110.5723 109.7629
2001-07-01 NA 110.8719 110.0625
2001-10-01 NA 110.9840 110.1746
2002-01-01 111 111.0000 111.0000
2002-04-01 NA 111.0150 110.8299
2002-07-01 NA 111.1219 110.9368
2002-10-01 NA 111.4184 111.2333
ADDED If you want to know whether a series was interpolated or not some approaches are:
add a comment to the series, e.g.
comment(c) <- "Originally annual"
, oruse a naming convention, e.g. add
_a
to the series name if it was originally annual:c_a <- c
, orif it's OK to retain both the
c_q
andc_q_adj
columns then for series that originated from quarterly data the two columns should be the same and otherwise not, orkeep a column for both the original data and the quarterly data
回答3:
Perhaps I'm missing something here, but assuming the annual value always comes from the first quarter, couldn't you just replace mean
in your aggregate
call with min
?
> d <- aggregate(c, as.integer(format(index(c),"%Y")), min, na.rm=TRUE)
> d
z_a z_q
2000 100 100
2001 110 110
2002 111 111
来源:https://stackoverflow.com/questions/32764847/convert-from-annual-to-quarterly-data-constrained-to-annual-average