问题
I have a data set that has a list of IDs, year, and income. I am trying to interpolate the yearly values to quarterly values.
id = c(2, 2, 2, 3, 3, 3,4,4,4,5,5)
year = c(2000, 2001, 2002, 2000,2001,2002, 2000,2001,2002,2000,2002)
income = c(20, 24, 26, 30,34,36, 40,46,48,53,56)
df = data.frame(id, year, income)
For e.g., I am looking to get the values of (interpolated) income for year-quarter 2000Q1, 2000Q2, 2000Q3, 2000Q4, 2001Q1, ... , 2001Q4. So the dataframe would be id,year-quarter, income. The income would be based on interpolated income.
I realize when linear interpolating, the trend must only be based on the respective IDs. Any suggestions on how I would do the interpolation in R?
回答1:
Here's an example using dplyr
:
library(dplyr)
annual_data <- data.frame(
person=c(1, 1, 1, 2, 2),
year=c(2010, 2011, 2012, 2010, 2012),
y=c(1, 2, 3, 1, 3)
)
expand_data <- function(x) {
years <- min(x$year):max(x$year)
quarters <- 1:4
grid <- expand.grid(quarter=quarters, year=years)
x$quarter <- 1
merged <- grid %>% left_join(x, by=c('year', 'quarter'))
merged$person <- x$person[1]
return(merged)
}
interpolate_data <- function(data) {
xout <- 1:nrow(data)
y <- data$y
interpolation <- approx(x=xout[!is.na(y)], y=y[!is.na(y)], xout=xout)
data$yhat <- interpolation$y
return(data)
}
expand_and_interpolate <- function(x) interpolate_data(expand_data(x))
quarterly_data <- annual_data %>% group_by(person) %>% do(expand_and_interpolate(.))
print(as.data.frame(quarterly_data))
The output from this approach is:
quarter year person y yhat
1 1 2010 1 1 1.00
2 2 2010 1 NA 1.25
3 3 2010 1 NA 1.50
4 4 2010 1 NA 1.75
5 1 2011 1 2 2.00
6 2 2011 1 NA 2.25
7 3 2011 1 NA 2.50
8 4 2011 1 NA 2.75
9 1 2012 1 3 3.00
10 2 2012 1 NA NA
11 3 2012 1 NA NA
12 4 2012 1 NA NA
13 1 2010 2 1 1.00
14 2 2010 2 NA 1.25
15 3 2010 2 NA 1.50
16 4 2010 2 NA 1.75
17 1 2011 2 NA 2.00
18 2 2011 2 NA 2.25
19 3 2011 2 NA 2.50
20 4 2011 2 NA 2.75
21 1 2012 2 3 3.00
22 2 2012 2 NA NA
23 3 2012 2 NA NA
24 4 2012 2 NA NA
There are probably a bunch of ways to clean this up. The key functions being used are expand.grid
, approx
, and dplyr::group_by
. The approx
function is a little tricky. Looking at the implementation of zoo::na.approx.default
was quite helpful in figuring out how to work with approx
.
回答2:
I like to use this convention to split a dataframe into subsets (unique values of 'id' in your case), apply a function to each subset, then put the data frame back together.
df2 <- do.call("rbind", lapply(split(df, df$id), function(df_subset) {
# the operations inside these brackets will be appied to a subset dataframe
# that is equivalent to doing 'subset(df, id == x)' where x is each unique value of id
return(df_subset) # this just returns df_subset unchanged, but you alter it in any way you need
}))
There are a few ways to do linear interpolation, but I personally default to using na.approx() from the 'zoo' package. You'll need to add rows representing each quarter to your dataframe, with NA for their income
value. Then na.approx will fill them in with an interpolated value, as in df_subset$income_interpolated <- na.approx(df_subset$income)
来源:https://stackoverflow.com/questions/32320727/interpolating-in-r-yearly-time-series-data-with-quarterly-values