How to do window function inside a when?

此生再无相见时 提交于 2020-01-06 06:03:33

问题


I have a dataframe where I am trying to do window function on a array column.

The logic is as follows: Group by (or window partition) the id and filtered columns. Calculate the max score of the rows where the types column is null, otherwise take the score of that row. When score is not equal to the max score of the group add "NA" to the column type.

val data = spark.createDataFrame(Seq(
  (1, "shirt for women", Seq("shirt", "women"), 19.1, "ST"),
  (1, "shirt for women", Seq("shirt", "women"), 10.1, null),
  (1, "shirt for women", Seq("shirt", "women"), 12.1, null),
  (0, "shirt group women", Seq("group", "women"), 15.1, null),
  (0, "shirt group women", Seq("group", "women"), 12.1, null),
  (3, "shirt nmn women", Seq("shirt", "women"), 16.1, "ST"),
  (3, "shirt were women", Seq("shirt", "women"), 13.1, "ST")
)).toDF("id", "raw", "filtered", "score", "types")

  +---+-----------------+--------------+-----+-----+
|id |raw              |filtered      |score|types|
+---+-----------------+--------------+-----+-----+
|1  |shirt for women  |[shirt, women]|19.1 |ST   |
|1  |shirt for women  |[shirt, women]|10.1 |null |
|1  |shirt for women  |[shirt, women]|12.1 |null |
|0  |shirt group women|[group, women]|15.1 |null |
|0  |shirt group women|[group, women]|12.1 |null |
|3  |shirt nmn women  |[shirt, women]|16.1 |ST   |
|3  |shirt were women |[shirt, women]|13.1 |ST   |
+---+-----------------+--------------+-----+-----+

Expected output:

   +---+------------------+--------------+-----+----+
    |id |raw              |filtered      |score|types|
    +---+-----------------+--------------+-----+----+
    |1  |shirt for women  |[shirt, women]|19.1 |ST  |
    |1  |shirt for women  |[shirt, women]|10.1 |NA  |
    |1  |shirt for women  |[shirt, women]|12.1 |null|
    |0  |shirt group women[women, group] |15.1 |null|
    |0  |shirt group women|[women, group]|12.1 |NA  |
    |3  |shirt nmn women  |[shirt, women]|16.1 |ST  |
    |3  |shirt were women |[shirt, women]|13.1 |ST  |
    +---+-----------------+--------------+-----+----+

I tried:

data.withColumn("max_score",
      when(col("types").isNull,
        max("score")
          .over(Window.partitionBy("id", "filtered")))
        .otherwise($"score"))
      .withColumn("type_temp",
        when(col("score") =!= col("max_score"),
          addReasonsUDF(col("type"),
            lit("NA")))
          .otherwise(col("type")))
      .drop("types", "max_score")
      .withColumnRenamed("type_temp", "types")

But it is not working. This gives me:

+---+-----------------+--------------+-----+---------+-----+
|id |raw              |filtered      |score|max_score|types|
+---+-----------------+--------------+-----+---------+-----+
|1  |shirt for women  |[shirt, women]|19.1 |19.1     |ST   |
|1  |shirt women      |[shirt, women]|10.1 |19.1     |NA   |
|1  |shirt of women   |[shirt, women]|12.1 |19.1     |NA   |
|0  |shirt group women|[group, women]|15.1 |15.1     |null |
|0  |shirt will women |[group, women]|12.1 |15.1     |NA   |
|3  |shirt nmn women  |[shirt, women]|16.1 |16.1     |ST   |
|3  |shirt were women |[shirt, women]|13.1 |13.1     |ST   |
+---+-----------------+--------------+-----+---------+-----+

Can some one tell me what I am doing wrong here ?

I think something wrong with my window function, when I tried partition against id and raw its not working as well. So both string and array partitions are not working.

dataSet.withColumn("max_score",
      when(col("types").isNull,
        max("score").over(Window.partitionBy("id", "raw")))
        .otherwise($"score")).show(false)

+---+-----------------+--------------+-----+-----+---------+
|id |raw              |filtered      |score|types|max_score|
+---+-----------------+--------------+-----+-----+---------+
|3  |shirt nmn women  |[shirt, women]|16.1 |ST   |16.1     |
|0  |shirt group women|[group, women]|15.1 |null |15.1     |
|0  |shirt group women|[group, women]|12.1 |null |15.1     |
|3  |shirt were women |[shirt, women]|13.1 |ST   |13.1     |
|1  |shirt for women  |[shirt, women]|19.1 |ST   |19.1     |
|1  |shirt for women  |[shirt, women]|10.1 |null |19.1     |
|1  |shirt for women  |[shirt, women]|12.1 |null |19.1     |
+---+-----------------+--------------+-----+-----+---------+

回答1:


You do not need to have a window function inside a when expression, instead this can be done in two stages. First add the max score as a new column to each group based on the id, filtered and the types columns. This will give a max score specifically for the groups where types are null. A window expression is preferred for this since the other columns should be kept.

After this, a check with when/otherwise can be done to change the value of the types column as long as types have a null value and the max score is not equal to score.

In code:

val w = Window.partitionBy("id", "filtered", "types")
val df = data.withColumn("max_score", max($"score").over(w))
  .withColumn("types", when($"types".isNull && $"score" =!= $"max_score", "NA").otherwise($"types"))
  .drop("max_score")


来源:https://stackoverflow.com/questions/56179275/how-to-do-window-function-inside-a-when

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!