问题
I have created two data frames. df_stg_raw
data frame is holding duplicate records. df_qualify
data frame is holding meta-information
such as partition
& order
is based on which column. I want to remove duplicate records using window
function available in PySpark.
df_stg_raw
==================================================
ACCNT_ID NAME SomeRandomID TABLE_NM
==================================================
1 A 123 TblA
1 A 123 TblA
2 B 124 TblA
2 B 124 TblA
3 C 125 TblA
3 C 125 TblA
df_qualify
==================================================
TABLE_NM QUALIFY_TXT ODER_BY
==================================================
TblA 'ACCNT_ID' 'SomeRandomID'
To solve this purpose, I am cross joining two data frames. However, I am not able to pass the paritionBy
& orderBy
column dynamically which is present in df_qualify
data frame.
df_final = df_stg_raw.join(df_qualify, df_stg_raw.TABLE_NM == df_qualify.TABLE_NM, how = "cross") \
.withColumn("row_num", row_number().over(window.partitionBy('**dynamic_part_col**').orderBy('**dynamic_order_col**')))
来源:https://stackoverflow.com/questions/58746514/dynamic-window-partitionby-column-in-pyspark