PySpark: how to resample frequencies

前端 未结 2 1704
广开言路
广开言路 2020-12-03 15:48

Imagine a Spark Dataframe consisting of value observations from variables. Each observation has a specific timestamp and those timestamps are not the same between different

2条回答
  •  醉梦人生
    2020-12-03 16:20

    I once answered a similar question, it'a bit of a hack but the idea makes sense in your case. Map every value on to a list, then flatten the list vertically.


    From: Inserting records in a spark dataframe:

    You can generate timestamp ranges, flatten them and select rows

    import pyspark.sql.functions as func
    
    from pyspark.sql.types import IntegerType, ArrayType
    
    
    a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
    .toDF(['timestamp','price'])
    
    f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
    
    a.withColumn('timestamp',f(a.timestamp))\
    .withColumn('timestamp',func.explode(func.col('timestamp')))\
    .groupBy('timestamp')\
    .agg(func.max(func.col('price')))\
    .show()
    
    +---------+----------+
    |timestamp|max(price)|
    +---------+----------+
    |670098928|        50|
    |670098929|        50|
    |670098930|        53|
    |670098931|        53|
    |670098932|        53|
    |670098933|        53|
    |670098934|        55|
    |670098935|        55|
    |670098936|        55|
    |670098937|        55|
    |670098938|        55|
    +---------+----------+
    

提交回复
热议问题