Split Time Series pySpark data frame into test & train without using random split

前端 未结 1 1077
野趣味
野趣味 2020-12-19 08:36

I have a spark Time Series data frame. I would like to split it into 80-20 (train-test). As this is a time series data frame, I don\'t want to do a random s

相关标签:
1条回答
  • 2020-12-19 08:52

    You can use pyspark.sql.functions.percent_rank() to get the percentile ranking of your DataFrame ordered by the timestamp/date column. Then pick all the columns with a rank <= 0.8 as your training set and the rest as your test set.

    For example, if you had the following DataFrame:

    df.show(truncate=False)
    #+---------------------+---+
    #|date                 |x  |
    #+---------------------+---+
    #|2018-01-01 00:00:00.0|0  |
    #|2018-01-02 00:00:00.0|1  |
    #|2018-01-03 00:00:00.0|2  |
    #|2018-01-04 00:00:00.0|3  |
    #|2018-01-05 00:00:00.0|4  |
    #+---------------------+---+
    

    You'd want the first 4 rows in your training set and the last one in your training set. First add a column rank:

    from pyspark.sql.functions import percent_rank
    from pyspark.sql import Window
    
    df = df.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("date")))
    

    Now use rank to split your data into train and test:

    train_df = df.where("rank <= .8").drop("rank")
    train_df.show()
    #+---------------------+---+
    #|date                 |x  |
    #+---------------------+---+
    #|2018-01-01 00:00:00.0|0  |
    #|2018-01-02 00:00:00.0|1  |
    #|2018-01-03 00:00:00.0|2  |
    #|2018-01-04 00:00:00.0|3  |
    #+---------------------+---+
    
    test_df = df.where("rank > .8").drop("rank")
    test_df.show()
    #+---------------------+---+
    #|date                 |x  |
    #+---------------------+---+
    #|2018-01-05 00:00:00.0|4  |
    #+---------------------+---+
    
    0 讨论(0)
提交回复
热议问题