Lags in R within specific subsets?

丶灬走出姿态 提交于 2019-12-08 08:12:08

问题


Suppose I have the following dataframe:

df <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4))

I'm trying to create a lag for unemployment within each unique state-county combination. I want to end up with this:

df2 <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4),"unemp_lag"=c(NA,4.0,3.6,NA,3.7,6.5))

Now, imagine this situation except with thousands of different county-state combinations and over several years. I tried using the lag function, the zoo.lag function, but I couldn't make it take into account the state-county codes. One possibility is to make a giant for loop, but I think this is too much data (R does not handle for loops well) and I am looking for a cleaner way to do it. Any ideas? Thanks!


回答1:


With data.table:

library(data.table)
setDT(df)[,`:=`(unemp_lag1=shift(unemp,n=1L,fill=NA, type="lag")),by=.(state, county)][]

   yearmonth state county unemp unemp_lag1
1:   2005-01     1      3   4.0         NA
2:   2005-02     1      3   3.6        4.0
3:   2005-03     1      3   1.4        3.6
4:   2005-01     2      3   3.7         NA
5:   2005-02     2      3   6.5        3.7
6:   2005-03     2      3   5.4        6.5



回答2:


Just an old style base R approach:

dsp <- split(df, list(df$state, df$county) )
dsp <- lapply(dsp, function(x) transform(x, unemp_lag =lag(unemp)))
dsp <- unsplit(dsp, list(df$state, df$county))
dsp
yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5

Edit

the lag function I used in my solution is the lag of dplyr (even though I didn't realized it until the BlondedDust comment) and here is a true and real pure base R solution (I hope):

dsp <- split(df, list(df$state, df$county) )
dsp <- lapply(dsp, function(x) transform(x, unemp_lag = c(NA, unemp[1:length(unemp)-1]) ) )
dsp <- unsplit(dsp, list(df$state, df$county))
dsp
  yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5



回答3:


With dplyr:

> library(dplyr)
> df %>% group_by(state, county) %>% mutate(unemp_lag=lag(unemp))
Source: local data frame [6 x 5]
Groups: state, county

   yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5

And with data.table:

> df2 <- as.data.table(df)
> df2[, unemp_lag := c(NA , unemp[-.N]), by=list(state, county)]

   yearmonth state county unemp unemp_lag
1:   2005-01     1      3   4.0        NA
2:   2005-02     1      3   3.6       4.0
3:   2005-03     1      3   1.4       3.6
4:   2005-01     2      3   3.7        NA
5:   2005-02     2      3   6.5       3.7
6:   2005-03     2      3   5.4       6.5


来源:https://stackoverflow.com/questions/31479671/lags-in-r-within-specific-subsets

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!