Using Linear Regression for Yearly distributed Time Series Data to get predictions after -N- years

问题

I am stuck with a very unique problem. I have Time Series Data where the data is given from the years 2009 to 2018. Problem is that I am to answer a very weird question using this data.

Data sheets contains the energy generation statistics of each Australian State/Territory in GWh ( Gigawatt hours) for the year 2009 to 2018.

There are following fields:


State: Names of different Australian states.
Fuel_Type:  The type of fuel which is consumed.
Category:  Determines whether a fuel is considered as a renewable or nonrenewable.
Years: Years which the energy consumptions are recorded.

Problem:

How can I use a linear regression model to predict what percentage of a state X say Victoria’s energy generation will come from y source say Renewable energy sources in the year Z suppose 2100?

How am I suppose to use a Linear Regression Model to solve the problem? This problem is beyond my reach.

Data is from this link

回答1:

I think first you need to think about what your model should look like at the end: You probably want something that relates the dependent variable y(fraction of renewable energy) to your input features. And one of those features should probably be the year since you are interest in predicting how y changes if you vary this quantity. So a very basic linear model could be y = beta1 * x + beta0 with x being the year, beta1 and beta0 being the parameters you want to fit and y being the fraction of renewable energy. This of course ignores the state component, but I think a simple start could be to fit such a model to the state you are interested in. The code for such an approach could look like this:

import matplotlib
matplotlib.use("agg")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import linregress
import numpy as np

def fracRenewable(df):
    return np.sum(df.loc[df["Category"] == "Renewable fuels", "amount"]/np.sum(df["amount"]))


# load in data

data = pd.read_csv("./energy_data.csv")

# convert data to tidy format and rename columns
molten = pd.melt(data, id_vars=["State", "Fuel_Type", "Category"])
           .rename(columns={"variable": "year", "value": "amount"})

# calculate fraction of renewable fuel per year
grouped = molten.groupby(["year"]).apply(fracRenewable)
                                  .reset_index()
                                  .rename(columns={0: "amount"})
grouped["year"] = grouped["year"].astype(int)

# >>> grouped
#    year    amount
# 0  2009  0.029338
# 1  2010  0.029207
# 2  2011  0.032219
# 3  2012  0.053738
# 4  2013  0.061332
# 5  2014  0.066198
# 6  2015  0.069404
# 7  2016  0.066531
# 8  2017  0.074625
# 9  2018  0.077445

# fit linear model
slope, intercept, r_value, p_value, std_err = linregress(grouped["year"], grouped["amount"])

# plot result
f, ax = plt.subplots()
sbn.scatterplot(x="year", y="amount", ax=ax, data=grouped)
ax.plot(range(2009, 2030), [i*slope + intercept for i in range(2009, 2030)], color="red")
ax.set_title("Renewable fuels (simple predicion)")
ax.set(ylabel="Fraction renewable fuel")
f.savefig("test11.png", bbox_inches="tight")

This gives you a (very simple) model to predict the fraction of renewable fuels at a given year.

If you want to refine the model further, I think a good start could be to group states together based on how similar they are (either based on prior knowledge or a clustering approach) and then do the predictions on those groups.

回答2:

Yes you can use linear regression for forecasting. There are different ways of how to use linear regression for forecasting. You can

fit a line to the training data and extrapolate that fitted line into the future, this is sometimes also called the drift method;
reduce the problem to a tabular regression problem, splitting the time series into fixed length windows and stacking them on top of each other and then use linear regression;
use other common trend methods.

Here's what (1) and (2) looks like with sktime (disclaimer: I'm one of the developers):

import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import smape_loss
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LinearRegression

y = load_airline()  # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)  

# here I forecast all observations of the test series, 
# in your case you could only select the years you're interested in
fh = np.arange(1, len(y_test) + 1)  

# option 1
forecaster = PolynomialTrendForecaster(degree=1)
forecaster.fit(y_train)
y_pred_1 = forecaster.predict(fh)

# option 2
forecaster = ReducedRegressionForecaster(LinearRegression(), window_length=10)
forecaster.fit(y_train)
y_pred_2 = forecaster.predict(fh)

来源：https://stackoverflow.com/questions/62304927/using-linear-regression-for-yearly-distributed-time-series-data-to-get-predictio

标签

python

machine-learning

time-series

linear-regression