How to filter data using window functions in spark

给你一囗甜甜゛ 提交于 2019-12-21 05:40:48

问题


I have the following data :

rowid uid time code
   1  1      5    a
   2  1      6    b
   3  1      7    c
   4  2      8    a
   5  2      9    c
   6  2      9    c
   7  2     10    c
   8  2     11    a
   9  2     12    c

Now I wanted to filter the data in such a way that I can remove the rows 6 and 7 as for a particular uid i want to keep just one row with value 'c' in code

So the expected data should be :

rowid uid time code
   1  1      5    a
   2  1      6    b
   3  1      7    c
   4  2      8    a
   5  2      9    c
   8  2     11    a
   9  2     12    c

I'm using window function something like this :

val window = Window.partitionBy("uid").orderBy("time")
val change = ((lag("code", 1).over(window) <=> "c")).cast("int")

This would help us identify each row with a code 'c'. Can i extend this to filter out rows to get the expected data


回答1:


If you want to remove only the lines where code = "c" (except the first one for each uid) you could try the following:

val window = Window.partitionBy("uid", "code").orderBy("time")
val result = df
  .withColumn("rank", row_number().over(window))
  .where(
    (col("code") !== "c") ||
    col("rank") === 1
  )
  .drop("rank")

Edit based on new information:

val window = Window.partitionBy("uid").orderBy("time")
val result = df
  .withColumn("lagValue", coalesce(lag(col("code"), 1).over(window), lit("")))
  .where(
    (col("code") !== "c") ||
    (col("lagValue") !== "c")
  )
  .drop("lagValue")


来源:https://stackoverflow.com/questions/38872592/how-to-filter-data-using-window-functions-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!