I have a dataframe where I want to give id\'s in each Window partition. For example I have
id | col |
1 | a |
2 | a |
3 | b |
4 | c |
5 | c |
You can assign a row_number for distinct col and self join with the original dataframe.
val data = Seq(
(1, "a"),
(2, "a"),
(3, "b"),
(4, "c"),
(5, "c")
).toDF("id","col")
val df2 = data.select("col").distinct()
.withColumn("group", row_number().over(Window.orderBy("col")))
val result = data.join(df2, Seq("col"), "left")
.drop("col")
The code is in scala but can be easily changed to pyspark.
Hope this helps