Create a group id over a window in Spark Dataframe

问题

I have a dataframe where I want to give id's in each Window partition. For example I have

id | col |
1  |  a  |
2  |  a  |
3  |  b  |
4  |  c  |
5  |  c  |

So I want (based on grouping with column col)

id | group |
1  |  1    |
2  |  1    |
3  |  2    |
4  |  3    |
5  |  3    |

I want to use a window function but I cannot find anyway to assign an Id to each window. I need something like:

w = Window().partitionBy('col')
df = df.withColumn("group", id().over(w))

Is there any way to achive somethong like that. (I cannot simply use col as a group id because I am interested in creating a window over multiple columns)

回答1:

Simply using a dense_rank inbuilt function over Window function should give you your desired result as

from pyspark.sql import window as W
import pyspark.sql.functions as f
df.select('id', f.dense_rank().over(W.Window.orderBy('col')).alias('group')).show(truncate=False)

which should give you

+---+-----+
|id |group|
+---+-----+
|1  |1    |
|2  |1    |
|3  |2    |
|4  |3    |
|5  |3    |
+---+-----+

回答2:

You can assign a row_number for distinct col and self join with the original dataframe.

val data = Seq(
  (1, "a"),
  (2, "a"),
  (3, "b"),
  (4, "c"),
  (5, "c")
).toDF("id","col")

val df2 = data.select("col").distinct()
  .withColumn("group", row_number().over(Window.orderBy("col")))


val result = data.join(df2, Seq("col"), "left")
    .drop("col")

The code is in scala but can be easily changed to pyspark.

Hope this helps

来源：https://stackoverflow.com/questions/50233518/create-a-group-id-over-a-window-in-spark-dataframe

标签

apache-spark

pyspark

apache-spark-sql

window-functions