Pyspark retain only distinct (drop all duplicates)

问题

After joining two dataframes (which have their own ID's) I have some duplicates (repeated ID's from both sources) I want to drop all rows that are duplicates on either ID (so not retain a single occurrence of a duplicate)

I can group by the first ID, do a count and filter for count ==1, then repeat that for the second ID, then inner join these outputs back to the original joined dataframe - but this feels a bit long.

Is there a simpler method like dropDuplicates() but where none of the duplicates are left behind?

I see pandas has an option not to keep the first duplicate df.drop_duplicates(subset=['A', 'C'], keep=False)

回答1:

dropDuplicates()

According to the official documentation.

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.

To drop duplicates considering all columns:

df.dropDuplicates()

If want to drop duplicates from certain column

df.dropDuplicate(subset=col_name)

For multiple columns:

df.dropDuplicates(subset=[col_name1, col_name2])

Edit for the comment

df =  df.agg(criteria_col).agg(sum(lit(1)).alias('freq'))

df = df.filter(col('freq')=1)

来源：https://stackoverflow.com/questions/51322699/pyspark-retain-only-distinct-drop-all-duplicates

标签

join

pyspark

duplicates

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!