I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20.
I used the following code for the same:
TL;DR If you want to split DataFrame
use randomSplit method:
ratings_sdf.randomSplit([0.6, 0.2, 0.2])
Your code is just wrong on multiple levels but there are two fundamental problems that make it broken beyond repair:
Spark transformations can be evaluated arbitrary number of times and functions you use should be referentially transparent and side effect free. Your code evaluates split_sdf
multiple times and you use stateful RNG data_split
so each time results are different.
This results in a behavior you describe where each child sees different state of the parent RDD.
You don't properly initialize RNG and in consequence random values you get are not independent.