Spark train test split

后端 未结 4 1512
难免孤独
难免孤独 2021-01-01 20:51

I am curious if there is something similar to sklearn\'s http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spa

4条回答
  •  渐次进展
    2021-01-01 21:16

    Although this answer is not specific to Spark, in Apache beam I do this to to split train 66% and test 33% (just an illustrative example, you can customize the partition_fn below to be more sophisticated and accept arguments such to specify the number of buckets or bias selection towards something or assure randomization is fair across dimensions, etc):

    raw_data = p | 'Read Data' >> Read(...)
    
    clean_data = (raw_data
                  | "Clean Data" >> beam.ParDo(CleanFieldsFn())
    
    
    def partition_fn(element):
        return random.randint(0, 2)
    
    random_buckets = (clean_data | beam.Partition(partition_fn, 3))
    
    clean_train_data = ((random_buckets[0], random_buckets[1])
                        | beam.Flatten())
    
    clean_eval_data = random_buckets[2]

提交回复
热议问题