Stratified sampling with pyspark

前端 未结 4 1273
误落风尘
误落风尘 2020-12-09 13:12

I have a Spark DataFrame that has one column that has lots of zeros and very few ones (only 0.01% of ones).

I\'d like to take a random

4条回答
  •  星月不相逢
    2020-12-09 13:26

    This can be accomplished pretty easily with 'randomSplit' and 'union' in PySpark.

    # read in data
    df = spark.read.csv(file, header=True)
    # split dataframes between 0s and 1s
    zeros = df.filter(df["Target"]==0)
    ones = df.filter(df["Target"]==1)
    # split datasets into training and testing
    train0, test0 = zeros.randomSplit([0.8,0.2], seed=1234)
    train1, test1 = ones.randomSplit([0.8,0.2], seed=1234)
    # stack datasets back together
    train = train0.union(train1)
    test = test0.union(test1)
    

提交回复
热议问题