I have a dataframe where I want to give id's in each Window partition. For example I have
id | col |
1 | a |
2 | a |
3 | b |
4 | c |
5 | c |
So I want (based on grouping with column col)
id | group |
1 | 1 |
2 | 1 |
3 | 2 |
4 | 3 |
5 | 3 |
I want to use a window function but I cannot find anyway to assign an Id to each window. I need something like:
w = Window().partitionBy('col')
df = df.withColumn("group", id().over(w))
Is there any way to achive somethong like that. (I cannot simply use col as a group id because I am interested in creating a window over multiple columns)
Simply using a dense_rank
inbuilt function over Window function should give you your desired result as
from pyspark.sql import window as W
import pyspark.sql.functions as f
df.select('id', f.dense_rank().over(W.Window.orderBy('col')).alias('group')).show(truncate=False)
which should give you
+---+-----+
|id |group|
+---+-----+
|1 |1 |
|2 |1 |
|3 |2 |
|4 |3 |
|5 |3 |
+---+-----+
You can assign a row_number
for distinct col
and self join
with the original dataframe.
val data = Seq(
(1, "a"),
(2, "a"),
(3, "b"),
(4, "c"),
(5, "c")
).toDF("id","col")
val df2 = data.select("col").distinct()
.withColumn("group", row_number().over(Window.orderBy("col")))
val result = data.join(df2, Seq("col"), "left")
.drop("col")
The code is in scala
but can be easily changed to pyspark
.
Hope this helps
来源:https://stackoverflow.com/questions/50233518/create-a-group-id-over-a-window-in-spark-dataframe