SparkSQL on pyspark: how to generate time series?

前端 未结 5 771
独厮守ぢ
独厮守ぢ 2020-12-30 08:53

I\'m using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and

5条回答
  •  执笔经年
    2020-12-30 09:25

    @Rakesh answer is correct, but I would like to share a less verbose solution:

    import datetime
    import pyspark.sql.types
    from pyspark.sql.functions import UserDefinedFunction
    
    # UDF
    def generate_date_series(start, stop):
        return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]    
    
    # Register UDF for later usage
    spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()) )
    
    # mydf is a DataFrame with columns `start` and `stop` of type DateType()
    mydf.createOrReplaceTempView("mydf")
    
    spark.sql("SELECT explode(generate_date_series(start, stop)) FROM mydf").show()
    

提交回复
热议问题