问题
Suppose I have the following dataframe:
df <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4))
I'm trying to create a lag for unemployment within each unique state-county combination. I want to end up with this:
df2 <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4),"unemp_lag"=c(NA,4.0,3.6,NA,3.7,6.5))
Now, imagine this situation except with thousands of different county-state combinations and over several years. I tried using the lag function, the zoo.lag function, but I couldn't make it take into account the state-county codes. One possibility is to make a giant for loop, but I think this is too much data (R does not handle for loops well) and I am looking for a cleaner way to do it. Any ideas? Thanks!
回答1:
With data.table:
library(data.table)
setDT(df)[,`:=`(unemp_lag1=shift(unemp,n=1L,fill=NA, type="lag")),by=.(state, county)][]
yearmonth state county unemp unemp_lag1
1: 2005-01 1 3 4.0 NA
2: 2005-02 1 3 3.6 4.0
3: 2005-03 1 3 1.4 3.6
4: 2005-01 2 3 3.7 NA
5: 2005-02 2 3 6.5 3.7
6: 2005-03 2 3 5.4 6.5
回答2:
Just an old style base R approach:
dsp <- split(df, list(df$state, df$county) )
dsp <- lapply(dsp, function(x) transform(x, unemp_lag =lag(unemp)))
dsp <- unsplit(dsp, list(df$state, df$county))
dsp
yearmonth state county unemp unemp_lag
1 2005-01 1 3 4.0 NA
2 2005-02 1 3 3.6 4.0
3 2005-03 1 3 1.4 3.6
4 2005-01 2 3 3.7 NA
5 2005-02 2 3 6.5 3.7
6 2005-03 2 3 5.4 6.5
Edit
the lag
function I used in my solution is the lag
of dplyr
(even though I didn't realized it until the BlondedDust comment) and here is a true and real pure base R solution (I hope):
dsp <- split(df, list(df$state, df$county) )
dsp <- lapply(dsp, function(x) transform(x, unemp_lag = c(NA, unemp[1:length(unemp)-1]) ) )
dsp <- unsplit(dsp, list(df$state, df$county))
dsp
yearmonth state county unemp unemp_lag
1 2005-01 1 3 4.0 NA
2 2005-02 1 3 3.6 4.0
3 2005-03 1 3 1.4 3.6
4 2005-01 2 3 3.7 NA
5 2005-02 2 3 6.5 3.7
6 2005-03 2 3 5.4 6.5
回答3:
With dplyr:
> library(dplyr)
> df %>% group_by(state, county) %>% mutate(unemp_lag=lag(unemp))
Source: local data frame [6 x 5]
Groups: state, county
yearmonth state county unemp unemp_lag
1 2005-01 1 3 4.0 NA
2 2005-02 1 3 3.6 4.0
3 2005-03 1 3 1.4 3.6
4 2005-01 2 3 3.7 NA
5 2005-02 2 3 6.5 3.7
6 2005-03 2 3 5.4 6.5
And with data.table:
> df2 <- as.data.table(df)
> df2[, unemp_lag := c(NA , unemp[-.N]), by=list(state, county)]
yearmonth state county unemp unemp_lag
1: 2005-01 1 3 4.0 NA
2: 2005-02 1 3 3.6 4.0
3: 2005-03 1 3 1.4 3.6
4: 2005-01 2 3 3.7 NA
5: 2005-02 2 3 6.5 3.7
6: 2005-03 2 3 5.4 6.5
来源:https://stackoverflow.com/questions/31479671/lags-in-r-within-specific-subsets