I am trying to predict a time series in python statsmodels ARIMA package with the inclusion of an exogenous variable, but cannot figure out the correct way to insert the exogenous variable in the predict step. See here for docs.
import numpy as np from scipy import stats import pandas as pd import statsmodels.api as sm vals = np.random.rand(13) ts = pd.TimeSeries(vals) df = pd.DataFrame(ts, columns=["test"]) df.index = pd.Index(pd.date_range("2011/01/01", periods = len(vals), freq = 'Q')) fit1 = sm.tsa.ARIMA(df, (1,0,0)).fit() #this works fine: pred1 = fit1.predict(start=12, end = 16) print(pred1) Out[32]: 2014-03-31 0.589121 2014-06-30 0.747575 2014-09-30 0.631322 2014-12-31 0.654858 2015-03-31 0.650093 Freq: Q-DEC, dtype: float64
now add in a trend exogenous variable
exogx = np.array(range(1,14)) #to make this easy, let's look at the ols of the trend (arima(0,0,0)) fit2 = sm.tsa.ARIMA(df, (0,0,0),exog = exogx).fit() print(fit2.params) const 0.555226 x1 0.013132 dtype: float64 print(fit2.fittedvalues) 2011-03-31 0.568358 2011-06-30 0.581490 2011-09-30 0.594622 2011-12-31 0.607754 2012-03-31 0.620886 2012-06-30 0.634018 2012-09-30 0.647150 2012-12-31 0.660282 2013-03-31 0.673414 2013-06-30 0.686546 2013-09-30 0.699678 2013-12-31 0.712810 2014-03-31 0.725942 Freq: Q-DEC, dtype: float64
Notice, as we would expect, this is a trend line, increasing 0.013132 with each increase tick in time (of course this is random data, so if you run it the values will be different, but the positive or negative trend story will be the same). So, the next value (for time = 14) should be 0.555226 + 0.013132*14 = 0.739074.
#out of sample exog should be (14,15,16) pred2 = fit2.predict(start = 12, end = 16, exog = np.array(range(13,17))) print(pred2) 2014-03-31 0.725942 2014-06-30 0.568358 2014-09-30 0.581490 2014-12-31 0.594622 2015-03-31 0.765338 Freq: Q-DEC, dtype: float64
So, 2014-03-31 predicts (the last insample) correctly, but 2014-06-30 starts back at the beginning (t = 1), but notice 2015-03-31 (actually, always the last observation of the forecast, regardless of horizon) picks up t = 16 (that is, (value - intercept)/beta = (0.765338 - 0.555226)/0.013132).
To make this more clear, notice what happens when I inflate the values of of the x mat
fit2.predict(start = 12, end = 16, exog = np.array(range(13,17))*10000) Out[41]: 2014-03-31 0.725942 2014-06-30 0.568358 2014-09-30 0.581490 2014-12-31 0.594622 2015-03-31 2101.680532 Freq: Q-DEC, dtype: float64
See that 2015-03-31 exploded, but none of the other xmat values were considered? What am I doing wrong here???
I have tried playing around with every way that I know how for passing in the exog variable (changing dimension, making the exog a matrix, making the exog as long as input plus the horizon, etc, etc, etc). Any suggestions would be really appreciated.
I am using 2.7 from Anaconda2.1 numpy 1.8.1 scipy 0.14.0 pandas 0.14.0 statsmodels 0.5.0
and have verified the issue on windows 7 64 bit, and centos 64 bit.
Also, a few things. I am using ARIMA for the ARIMA functionality and the above is just for illustration (that is, I cannot "just use OLS...", as I imagine will be suggested). I also cannot "just use R" due to the restrictions of the project (and more generally, the lack of support of R in base Spark).
Here are the interesting parts of the code all together in case you want to try it yourself
import numpy as np from scipy import stats import pandas as pd import statsmodels.api as sm vals = np.random.rand(13) ts = pd.TimeSeries(vals) df = pd.DataFrame(ts, columns=["test"]) df.index = pd.Index(pd.date_range("2011/01/01", periods = len(vals), freq = 'Q')) exogx = np.array(range(1,14)) fit2 = sm.tsa.ARIMA(df, (0,0,0),exog = exogx).fit() print(fit2.fittedvalues) pred2 = fit2.predict(start = 12, end = 16, exog = np.array(range(13,17))*10000) print(pred2)