Window Functions partitionBy over a list

╄→尐↘猪︶ㄣ 提交于 2019-12-24 22:25:03

问题


I have a dataframe tableDS In scala I am able to remove duplicates over primary keys using the following -

import org.apache.spark.sql.expressions.Window.partitionBy
import org.apache.spark.sql.functions.row_number

val window = partitionBy(primaryKeySeq.map(k => tableDS(k)): _*).orderBy(tableDS(mergeCol).desc)
tableDS.withColumn("rn", row_number.over(window)).where($"rn" === 1).drop("rn")

I need to write a similar thing in python. primaryKeySeq is a list in python. I tried the first statement like this -

from pyspark.sql.window import Window
import pyspark.sql.functions as func

window = Window.partitionBy(primaryKeySeq).orderBy(tableDS[bdtVersionColumnName].desc())
tableDS1=tableDS.withColumn("rn",rank().over(window))

This does not give me the correct result.


回答1:


It got solved - Here is the final conversion.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy(primaryKeySeq).orderBy(tableDS[bdtVersionColumnName].desc())
tableDS1=tableDS.withColumn("rn", row_number.over(window)).where(tableDS["rn"] == 1).drop("rn")


来源:https://stackoverflow.com/questions/52677157/window-functions-partitionby-over-a-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!