How to predict correctly in sklearn RandomForestRegressor?

天大地大妈咪最大 提交于 2020-01-06 04:54:06

问题


I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv

I'm trying to predict the next values of "LandAverageTemperature".

First, I've imported the csv into pandas and made it DataFrame named "df1".

After taking errors on my first tries in sklearn, I converted the "dt" column into datetime64 from string then added a column named "year" that shows only the years in the date values.-Its probably wrong-

df1["year"] = pd.DatetimeIndex(df1['dt']).year

After all of that, I prepared my data for reggression and called RandomForestReggressor:

landAvg = df1[["LandAverageTemperature"]]
year = df1[["year"]]

from sklearn.ensemble import RandomForestRegressor

rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(year,landAvg.values.ravel())
print("Random forest:",rf_reg.predict(landAvg))

I ran the code and I've seen this result:

Random forest: [9.26558115 9.26558115 9.26558115 ... 9.26558115 9.26558115 9.26558115]

I'm not getting any errors but I don't think the results are correct -results are all the same as you can see-. Besides, when I want to get next 10 year's predictions, I don't know how to do that. I just get 1 result with this code. Can you help me for improve my code and get the right results? Thanks in advance for your help.


回答1:


It's not enought to use only year to predict temperature. Your need to use month data too. Here is a working example for starters:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('https://raw.githubusercontent.com/gindeleo/climate/master/GlobalTemperatures.csv', usecols=['dt','LandAverageTemperature'], parse_dates=['dt'])
df = df.dropna()
df["year"] = df['dt'].dt.year
df["month"] = df['dt'].dt.month
X = df[["month", "year"]]
y = df["LandAverageTemperature"]
rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(X, y)
y_pred = rf_reg.predict(X)
df_result = pd.DataFrame({'year': X['year'], 'month': X['month'], 'true': y, 'pred': y_pred})
print('True values and predictions')
print(df_result)
print('Feature importances', list(zip(X.columns, rf_reg.feature_importances_)))

And here is output:

True values and predictions
      year  month    true     pred
0     1750      1   3.034   2.2944
1     1750      2   3.083   2.4222
2     1750      3   5.626   5.6434
3     1750      4   8.490   8.3419
4     1750      5  11.573  11.7569
...    ...    ...     ...      ...
3187  2015      8  14.755  14.8004
3188  2015      9  12.999  13.0392
3189  2015     10  10.801  10.7068
3190  2015     11   7.433   7.1173
3191  2015     12   5.518   5.1634

[3180 rows x 4 columns]
Feature importances [('month', 0.9543059863177156), ('year', 0.045694013682284394)]


来源:https://stackoverflow.com/questions/59460378/how-to-predict-correctly-in-sklearn-randomforestregressor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!