Creating indices for each group in Spark dataframe

后端 未结 1 1353
逝去的感伤
逝去的感伤 2021-01-02 07:42

I have a dataframe in Spark with 2 columns, group_id and value, where value is a double. I would like to group the data based on the <

1条回答
  •  既然无缘
    2021-01-02 08:19

    You can use Window functions to create a rank column based on value, partitioned by group_id:

    from pyspark.sql.window import Window
    from pyspark.sql.functions import rank, dense_rank
    # Define window
    window = Window.partitionBy(df['group_id']).orderBy(df['value'])
    # Create column
    df.select('*', rank().over(window).alias('index')).show()
    +--------+-----+-----+
    |group_id|value|index|
    +--------+-----+-----+
    |       1| -1.7|    1|
    |       1|  0.0|    2|
    |       1|  1.3|    3|
    |       1|  2.7|    4|
    |       1|  3.4|    5|
    |       2|  0.8|    1|
    |       2|  2.3|    2|
    |       2|  5.9|    3|
    +--------+-----+-----+
    

    Because, you first select '*', you keep all other variables using the above code as well. However, your second example shows that you are looking for the function dense_rank(), which gives as a rank column with no gaps:

    df.select('*', dense_rank().over(window).alias('index'))
    

    0 讨论(0)
提交回复
热议问题