I have a dataframe in Spark with 2 columns, group_id
and value
, where value
is a double. I would like to group the data based on the <
You can use Window
functions to create a rank column based on value
, partitioned by group_id
:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank
# Define window
window = Window.partitionBy(df['group_id']).orderBy(df['value'])
# Create column
df.select('*', rank().over(window).alias('index')).show()
+--------+-----+-----+
|group_id|value|index|
+--------+-----+-----+
| 1| -1.7| 1|
| 1| 0.0| 2|
| 1| 1.3| 3|
| 1| 2.7| 4|
| 1| 3.4| 5|
| 2| 0.8| 1|
| 2| 2.3| 2|
| 2| 5.9| 3|
+--------+-----+-----+
Because, you first select '*'
, you keep all other variables using the above code as well. However, your second example shows that you are looking for the function dense_rank()
, which gives as a rank column with no gaps:
df.select('*', dense_rank().over(window).alias('index'))