发表新帖

发表新帖

Spark Data Frame Random Splitting

后端未结

关注

 1  1865

悲哀的现实

I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20.

I used the following code for the same:

相关标签:

1条回答

小蘑菇

2020-12-17 10:34
TL;DR If you want to split DataFrame use randomSplit method:
```
ratings_sdf.randomSplit([0.6, 0.2, 0.2])
```
Your code is just wrong on multiple levels but there are two fundamental problems that make it broken beyond repair:
- Spark transformations can be evaluated arbitrary number of times and functions you use should be referentially transparent and side effect free. Your code evaluates split_sdf multiple times and you use stateful RNG data_split so each time results are different.
  
  This results in a behavior you describe where each child sees different state of the parent RDD.
- You don't properly initialize RNG and in consequence random values you get are not independent.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题