问题
After joining two dataframes (which have their own ID's) I have some duplicates (repeated ID's from both sources) I want to drop all rows that are duplicates on either ID (so not retain a single occurrence of a duplicate)
I can group by the first ID, do a count and filter for count ==1, then repeat that for the second ID, then inner join these outputs back to the original joined dataframe - but this feels a bit long.
Is there a simpler method like dropDuplicates() but where none of the duplicates are left behind?
I see pandas has an option not to keep the first duplicate df.drop_duplicates(subset=['A', 'C'], keep=False)
回答1:
dropDuplicates()
According to the official documentation.
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.
To drop duplicates considering all columns:
df.dropDuplicates()
If want to drop duplicates from certain column
df.dropDuplicate(subset=col_name)
For multiple columns:
df.dropDuplicates(subset=[col_name1, col_name2])
Edit for the comment
df = df.agg(criteria_col).agg(sum(lit(1)).alias('freq'))
df = df.filter(col('freq')=1)
来源:https://stackoverflow.com/questions/51322699/pyspark-retain-only-distinct-drop-all-duplicates