Spark Data Frame Random Splitting

后端 未结 1 1862
悲哀的现实
悲哀的现实 2020-12-17 10:06

I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20.

I used the following code for the same:



        
相关标签:
1条回答
  • 2020-12-17 10:34

    TL;DR If you want to split DataFrame use randomSplit method:

    ratings_sdf.randomSplit([0.6, 0.2, 0.2])
    

    Your code is just wrong on multiple levels but there are two fundamental problems that make it broken beyond repair:

    • Spark transformations can be evaluated arbitrary number of times and functions you use should be referentially transparent and side effect free. Your code evaluates split_sdf multiple times and you use stateful RNG data_split so each time results are different.

      This results in a behavior you describe where each child sees different state of the parent RDD.

    • You don't properly initialize RNG and in consequence random values you get are not independent.

    0 讨论(0)
提交回复
热议问题