Simulating a timeseries in dplyr instead of using a for loop

问题

So, while lag and lead in dplyr are great, I want to simulate a timeseries of something like population growth. My old school code would look something like:

tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
  tdf$pop[i] = 1.1*tdf$pop[i-1]
}

which produces

  time    pop
1    1 50.000
2    2 55.000
3    3 60.500
4    4 66.550
5    5 73.205

I feel like there has to be a dplyr or tidyverse way to do this (as much as I love my for loop).

But, something like

tdf <- data.frame(time=1:5, pop=50) %>%
  mutate(pop = 1.1*lag(pop))

which would have been my first guess just produces

I feel like I'm missing something obvious.... what is it?

Note - this is a trivial example - my real examples use multiple parameters, many of which are time-varying (I'm simulating forecasts under different GCM scenarios), so, the tidyverse is proving to be a powerful tool in bringing my simulations together.

回答1:

Reduce (or its purrr variants, if you like) is what you want for cumulative functions that don't already have a cum* version written:

data.frame(time = 1:5, pop = 50) %>%
    mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))

##   time    pop
## 1    1 50.000
## 2    2 55.000
## 3    3 60.500
## 4    4 66.550
## 5    5 73.205

or with purrr,

data.frame(time = 1:5, pop = 50) %>%
    mutate(pop = accumulate(pop, ~.x * 1.1))

##   time    pop
## 1    1 50.000
## 2    2 55.000
## 3    3 60.500
## 4    4 66.550
## 5    5 73.205

回答2:

If the starting value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will give you the next four values. With your code, you could do:

data.frame(time=1:5, pop=50) %>%
  mutate(pop = pop * 1.1^(1:n() - 1))

Or,

base = 50

data.frame(time=1:5) %>%
  mutate(pop = base * 1.1^(1:n()-1))

回答3:

Purrr's accumulate function can handle time-varying indices, if you pass them to your simulation function as a list with all the parameters in it. However, it takes a bit of wrangling to get this working correctly. The trick here is that accumulate() can work on list as well as vector columns. You can use the tidyr function nest() to group columns into a list vector containing the current population state and parameters, then use accumulate() on the resulting list column. This is a bit complicated to explain, so I've included a demo, simulating logistic growth with either a constant growth rate or a time-varying stochastic growth rate. I also included an example of how to use this to simulate multiple replicates for a given model using dpylr+purrr+tidyr.

library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)

# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector. 
# This example function is a Ricker population growth model. 
logistic_growth = function(.x, .y, growth, comp) {
  pop = .x$pop[1]
  growth = .y$growth[1]
  comp  = .y$comp[1]
  # Note: this uses the state from .x, and the parameter values from .y.
  # The first observation will use the first entry in the vector for .x and .y
  new_pop = pop*exp(growth - pop*comp)
  .y$pop[1] = new_pop
  return(.y)
}

# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps  = 100
pop_init = 1
growth = 0.5
comp = 0.05

#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init, 
                   growth=growth,comp =comp)


# here, the combination of nest() and group_by() split the data into individual 
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
  group_by(time)%>%
  nest(pop, growth, comp,.key = state)%>%
  ungroup()%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init, 
                   growth=rnorm(n_steps, growth,0.1),comp=comp)

out2 = test2 %>%
  group_by(time)%>%
  nest(pop, growth, comp,.key = state)%>%
  ungroup()%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
  mutate(growth=rnorm(n_steps*10, growth,0.1))

out3 = test3 %>%
  group_by(rep)%>%
  group_by(rep,time)%>%
  nest(pop, growth, comp,.key = state)%>%
  group_by(rep)%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

print(qplot(time, pop, data=out1)+
        geom_line() +
        geom_point(data= out2, col="red")+
        geom_line(data=out2, col="red")+
        geom_point(data=out3, col="red", alpha=0.1)+
        geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))

回答4:

What about the map functions, i.e.

tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

回答5:

The problem here is that dplyr is running this as a set of vector operations rather than evaluating the term one at a time. Here, 1.1*lag(pop) is being interpreted as "calculate the lagged values for all of pop, then multiple them all by 1.1". Since you set pop=50 lagged values for all the steps were 50.

dplyr does have some helper functions for sequential evaluation; the standard function cumsum, cumprod, etc. work, and a few new ones (see ?cummean) all work within dplyr. In your example, you could simulate the model with:

tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
    mutate(pop = pop*cumprod(growth_rate))


time    pop     growth_rate
1       50.000  1.0
2       55.000  1.1
3       60.500  1.1
4       66.550  1.1
5       73.205  1.1

Note that I added growth rate as a column here, and I set the first growth rate to 1. You could also specify it like this:

tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
    mutate(pop = pop*cumprod(lead(growth_rate,default=1))

This makes it explicit that the growth rate column refers to the rate of growth in the current time step from the previous one.

There are limits to how many different simulations you can do this way, but it should be feasible to construct a lot of discrete-time ecological models using some combination of the cumulative functions and parameters specified in columns.

来源：https://stackoverflow.com/questions/40095463/simulating-a-timeseries-in-dplyr-instead-of-using-a-for-loop

标签

dplyr

tidyverse