Is this random forest overfitted?

爷,独闯天下 提交于 2019-12-11 09:34:21

问题


I am training RandomForestRegressor from the scikit-learn library on temporal data and want the forest to predict the trend (next 4 points) given date and time as features.

I am predicting the data in small intervals (4 datapoints) and trying to reconstruct the whole day trend to compare to the actual values and calculate MSE by slicing the dataset

As you can see on the graph below (the first one), predicted line has some patches that are very similar to the actual data line. The only problem is that those similar patches are ahead in time in comparison to the actual line (marked in black circles on the graph)

Does this mean that the model has learned the training data and just spits out the last values it remembered? I have not done any model tunning, only data collection and evaluation of the results for now

Added graph without black markings so it is easier to see the lines

EDIT: I have edited the prediction as I was afraid there was a bug in the code that produced previous graphs

As suggested by @vpekar in the comments, I have a) compared MSEs after the out-of-sample and in-sample evaluations. Median MSE after the ten out-of-sample evaluations is 4.14e-08, while median MSE after the ten in-sample evaluations is 5.30e-08. Figure 3 shows how both results looked like more or less. Figure 3

All the evaluation has been done with a standard non-tuned model

`RandomForestRegressor(n_estimators=10000, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1,max_features=5, max_leaf_nodes=None, bootstrap=False, oob_score=False, n_jobs=1, verbose=0, warm_start=False)`

After that I ran b) random search for model parameters and got the best results (Figure 4) on the out-of-sample evaluation 6.3e-06 (100 times worse than the MSE of the default model) with the following parameters:

bootstrap=False, criterion=mse, max_depth=35, max_features=1, max_leaf_nodes=60, min_impurity_decrease=0 min_impurity_split=None, min_samples_leaf=74, min_samples_split=64, min_weight_fraction_leaf=0 n_estimators=10000, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False

Figure 4

Question: Does this mean that the default RandomForestRegressor parameters lead to an overfitted model in the case of my data? (Figure 3)

来源:https://stackoverflow.com/questions/54389360/is-this-random-forest-overfitted

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!