Cross Validation metrics with Pyspark
When we do a k-fold Cross Validation we are testing how well a model behaves when it comes to predict data it has never seen. If split my dataset in 90% training and 10% test and analyse the model performance, there is no guarantee that my test set doesn't contain only the 10% "easiest" or "hardest" points to predict. By doing a 10-fold cross validation I can be assured that every point will at least be used once for training. As (in this case) the model will be tested 10 times we can do an analysis of those tests metrics which will provide us with a better understanding of how the model is