PySpark Dataframe cast two columns into new column of tuples based value of a third column

后端 未结 2 1025
夕颜
夕颜 2021-01-07 09:55

As the subject describes, I have a PySpark Dataframe that I need to cast two columns into a new column that is a list of tuples based the value of a third column. This cast

2条回答
  •  夕颜
    夕颜 (楼主)
    2021-01-07 10:15

    Assuming your Dataframe is called df:

    from pyspark.sql.functions import struct
    from pyspark.sql.functions import collect_list
    
    gdf = (df.select("product_id", "category", struct("purchase_date", "warranty_days").alias("pd_wd"))
    .groupBy("product_id")
    .pivot("category")
    .agg(collect_list("pd_wd")))
    

    Essentially, you have to group the purchase_date and warranty_days into a single column using struct(). Then, you are just grouping by product_id, pivoting by category, can aggregating as collect_list().

提交回复
热议问题