问题
I am stuck with a very unique problem. I have Time Series Data where the data is given from the years 2009 to 2018. Problem is that I am to answer a very weird question using this data.
Data sheets contains the energy generation statistics of each Australian State/Territory in GWh ( Gigawatt hours) for the year 2009 to 2018.
There are following fields:
State: Names of different Australian states.
Fuel_Type: The type of fuel which is consumed.
Category: Determines whether a fuel is considered as a renewable or nonrenewable.
Years: Years which the energy consumptions are recorded.
Problem:
How can I use a linear regression model to predict what percentage of a state X
say Victoria’s energy generation
will come from y source
say Renewable energy sources in the year Z
suppose 2100?
How am I suppose to use a Linear Regression Model to solve the problem? This problem is beyond my reach.
Data is from this link
回答1:
I think first you need to think about what your model should look like at the end: You probably want something that relates the dependent variable y
(fraction of renewable energy) to your input features. And one of those features should probably be the year since you are interest in predicting how y
changes if you vary this quantity. So a very basic linear model could be y = beta1 * x + beta0
with x
being the year, beta1
and beta0
being the parameters you want to fit and y
being the fraction of renewable energy. This of course ignores the state component, but I think a simple start could be to fit such a model to the state you are interested in. The code for such an approach could look like this:
import matplotlib
matplotlib.use("agg")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import linregress
import numpy as np
def fracRenewable(df):
return np.sum(df.loc[df["Category"] == "Renewable fuels", "amount"]/np.sum(df["amount"]))
# load in data
data = pd.read_csv("./energy_data.csv")
# convert data to tidy format and rename columns
molten = pd.melt(data, id_vars=["State", "Fuel_Type", "Category"])
.rename(columns={"variable": "year", "value": "amount"})
# calculate fraction of renewable fuel per year
grouped = molten.groupby(["year"]).apply(fracRenewable)
.reset_index()
.rename(columns={0: "amount"})
grouped["year"] = grouped["year"].astype(int)
# >>> grouped
# year amount
# 0 2009 0.029338
# 1 2010 0.029207
# 2 2011 0.032219
# 3 2012 0.053738
# 4 2013 0.061332
# 5 2014 0.066198
# 6 2015 0.069404
# 7 2016 0.066531
# 8 2017 0.074625
# 9 2018 0.077445
# fit linear model
slope, intercept, r_value, p_value, std_err = linregress(grouped["year"], grouped["amount"])
# plot result
f, ax = plt.subplots()
sbn.scatterplot(x="year", y="amount", ax=ax, data=grouped)
ax.plot(range(2009, 2030), [i*slope + intercept for i in range(2009, 2030)], color="red")
ax.set_title("Renewable fuels (simple predicion)")
ax.set(ylabel="Fraction renewable fuel")
f.savefig("test11.png", bbox_inches="tight")
This gives you a (very simple) model to predict the fraction of renewable fuels at a given year.
If you want to refine the model further, I think a good start could be to group states together based on how similar they are (either based on prior knowledge or a clustering approach) and then do the predictions on those groups.
回答2:
Yes you can use linear regression for forecasting. There are different ways of how to use linear regression for forecasting. You can
- fit a line to the training data and extrapolate that fitted line into the future, this is sometimes also called the drift method;
- reduce the problem to a tabular regression problem, splitting the time series into fixed length windows and stacking them on top of each other and then use linear regression;
- use other common trend methods.
Here's what (1) and (2) looks like with sktime (disclaimer: I'm one of the developers):
import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import smape_loss
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LinearRegression
y = load_airline() # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)
# here I forecast all observations of the test series,
# in your case you could only select the years you're interested in
fh = np.arange(1, len(y_test) + 1)
# option 1
forecaster = PolynomialTrendForecaster(degree=1)
forecaster.fit(y_train)
y_pred_1 = forecaster.predict(fh)
# option 2
forecaster = ReducedRegressionForecaster(LinearRegression(), window_length=10)
forecaster.fit(y_train)
y_pred_2 = forecaster.predict(fh)
来源:https://stackoverflow.com/questions/62304927/using-linear-regression-for-yearly-distributed-time-series-data-to-get-predictio