问题
I am training RandomForestRegressor from the scikit-learn library on temporal data and want the forest to predict the trend (next 4 points) given date and time as features.
I am predicting the data in small intervals (4 datapoints) and trying to reconstruct the whole day trend to compare to the actual values and calculate MSE by slicing the dataset
As you can see on the graph below (the first one), predicted line has some patches that are very similar to the actual data line. The only problem is that those similar patches are ahead in time in comparison to the actual line (marked in black circles on the graph)
Does this mean that the model has learned the training data and just spits out the last values it remembered? I have not done any model tunning, only data collection and evaluation of the results for now
Added graph without black markings so it is easier to see the lines
EDIT: I have edited the prediction as I was afraid there was a bug in the code that produced previous graphs
As suggested by @vpekar in the comments, I have a) compared MSEs after the out-of-sample and in-sample evaluations. Median MSE after the ten out-of-sample evaluations is 4.14e-08, while median MSE after the ten in-sample evaluations is 5.30e-08. Figure 3 shows how both results looked like more or less. Figure 3
All the evaluation has been done with a standard non-tuned model
`RandomForestRegressor(n_estimators=10000, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1,max_features=5, max_leaf_nodes=None, bootstrap=False, oob_score=False, n_jobs=1, verbose=0, warm_start=False)`
After that I ran b) random search for model parameters and got the best results (Figure 4) on the out-of-sample evaluation 6.3e-06 (100 times worse than the MSE of the default model) with the following parameters:
bootstrap=False, criterion=mse, max_depth=35, max_features=1, max_leaf_nodes=60, min_impurity_decrease=0 min_impurity_split=None, min_samples_leaf=74, min_samples_split=64, min_weight_fraction_leaf=0 n_estimators=10000, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False
Figure 4
Question: Does this mean that the default RandomForestRegressor parameters lead to an overfitted model in the case of my data? (Figure 3)
来源:https://stackoverflow.com/questions/54389360/is-this-random-forest-overfitted