SparkSQL on pyspark: how to generate time series?

前端 未结 5 773
独厮守ぢ
独厮守ぢ 2020-12-30 08:53

I\'m using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and

5条回答
  •  悲&欢浪女
    2020-12-30 09:18

    Suppose you have dataframe df from spark sql, Try this

    from pyspark.sql.functions as F
    from pyspark.sql.types as T
    
    def timeseriesDF(start, total):
        series = [start]
        for i xrange( total-1 ):
            series.append(
                F.date_add(series[-1], 1)
            )
        return series
    
    df.withColumn("t_series", F.udf(
                    timeseriesDF, 
                    T.ArrayType()
                ) ( df.start, F.datediff( df.start, df.stop ) ) 
        ).select(F.explode("t_series")).show()
    

提交回复
热议问题