Create a group id over a window in Spark Dataframe

一曲冷凌霜 提交于 2019-11-30 20:36:14

问题


I have a dataframe where I want to give id's in each Window partition. For example I have

id | col |
1  |  a  |
2  |  a  |
3  |  b  |
4  |  c  |
5  |  c  |

So I want (based on grouping with column col)

id | group |
1  |  1    |
2  |  1    |
3  |  2    |
4  |  3    |
5  |  3    |

I want to use a window function but I cannot find anyway to assign an Id to each window. I need something like:

w = Window().partitionBy('col')
df = df.withColumn("group", id().over(w)) 

Is there any way to achive somethong like that. (I cannot simply use col as a group id because I am interested in creating a window over multiple columns)


回答1:


Simply using a dense_rank inbuilt function over Window function should give you your desired result as

from pyspark.sql import window as W
import pyspark.sql.functions as f
df.select('id', f.dense_rank().over(W.Window.orderBy('col')).alias('group')).show(truncate=False)

which should give you

+---+-----+
|id |group|
+---+-----+
|1  |1    |
|2  |1    |
|3  |2    |
|4  |3    |
|5  |3    |
+---+-----+



回答2:


You can assign a row_number for distinct col and self join with the original dataframe.

val data = Seq(
  (1, "a"),
  (2, "a"),
  (3, "b"),
  (4, "c"),
  (5, "c")
).toDF("id","col")

val df2 = data.select("col").distinct()
  .withColumn("group", row_number().over(Window.orderBy("col")))


val result = data.join(df2, Seq("col"), "left")
    .drop("col")

The code is in scala but can be easily changed to pyspark.

Hope this helps



来源:https://stackoverflow.com/questions/50233518/create-a-group-id-over-a-window-in-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!