Spark train test split

后端未结

关注

 4  1512

难免孤独 2021-01-01 20:51

I am curious if there is something similar to sklearn\'s http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spa

4条回答

渐次进展 (楼主)

2021-01-01 21:16
Although this answer is not specific to Spark, in Apache beam I do this to to split train 66% and test 33% (just an illustrative example, you can customize the partition_fn below to be more sophisticated and accept arguments such to specify the number of buckets or bias selection towards something or assure randomization is fair across dimensions, etc):
```
raw_data = p | 'Read Data' >> Read(...)

clean_data = (raw_data
              | "Clean Data" >> beam.ParDo(CleanFieldsFn())


def partition_fn(element):
    return random.randint(0, 2)

random_buckets = (clean_data | beam.Partition(partition_fn, 3))

clean_train_data = ((random_buckets[0], random_buckets[1])
                    | beam.Flatten())

clean_eval_data = random_buckets[2]
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...