Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?
Pandas:
df.sort_value
I just did something perhaps similar to what you guys need, using drop_duplicates pyspark.
Situation is this. I have 2 dataframes (coming from 2 files) which are exactly same except 2 columns file_date(file date extracted from the file name) and data_date(row date stamp). Annoyingly I have rows which are with same data_date (and all other column cells too) but different file_date as they get replicated on every newcomming file with an addition of one new row.
I needed to capture all rows from the new file, plus that one row left over from the previous file. That row is not in the new file. Remaining columns on the right from data_date are same between the two files for the same data_date.
file_1_20190122 - df1
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-22|2019-01-16| <- One row we want to keep where file_date 22nd
| AGGH|2019-01-22|2019-01-17|
| AGGH|2019-01-22|2019-01-18|
| AGGH|2019-01-22|2019-01-19|
| AGGH|2019-01-22|2019-01-20|
| AGGH|2019-01-22|2019-01-21|
| AGGH|2019-01-22|2019-01-22|
file_2_20190123 - df2
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-23|2019-01-17| \/ ALL rows we want to keep where file_date 23rd
| AGGH|2019-01-23|2019-01-18|
| AGGH|2019-01-23|2019-01-19|
| AGGH|2019-01-23|2019-01-20|
| AGGH|2019-01-23|2019-01-21|
| AGGH|2019-01-23|2019-01-22|
| AGGH|2019-01-23|2019-01-23|
This will require us to sort and concat df's, then deduplicate them on all columns but one. Let me walk you through.
union_df = df1.union(df2) \
.sort(['station_code', 'data_date'], ascending=[True, True])
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-22|2019-01-16| <- keep
| AGGH|2019-01-23|2019-01-17| <- keep
| AGGH|2019-01-22|2019-01-17| x- drop
| AGGH|2019-01-22|2019-01-18| x- drop
| AGGH|2019-01-23|2019-01-18| <- keep
| AGGH|2019-01-23|2019-01-19| <- keep
| AGGH|2019-01-22|2019-01-19| x- drop
| AGGH|2019-01-23|2019-01-20| <- keep
| AGGH|2019-01-22|2019-01-20| x- drop
| AGGH|2019-01-22|2019-01-21| x- drop
| AGGH|2019-01-23|2019-01-21| <- keep
| AGGH|2019-01-23|2019-01-22| <- keep
| AGGH|2019-01-22|2019-01-22| x- drop
| AGGH|2019-01-23|2019-01-23| <- keep
Here we drop already sorted duped rows excluding keys ['file_date', 'data_date'].
nonduped_union_df = union_df \
.drop_duplicates(['station_code', 'data_date', 'time_zone',
'latitude', 'longitude', 'elevation',
'highest_temperature', 'lowest_temperature',
'highest_temperature_10_year_normal',
'another_50_columns'])
And the result holds ONE row with earliest date from DF1 which is not in DF2 and ALL rows from DF2
nonduped_union_df.select(['station_code', 'file_date', 'data_date',
'highest_temperature', 'lowest_temperature']) \
.sort(['station_code', 'data_date'], ascending=[True, True]) \
.show(30)
+------------+----------+----------+-------------------+------------------+
|station_code| file_date| data_date|highest_temperature|lowest_temperature|
+------------+----------+----------+-------------------+------------------+
| AGGH|2019-01-22|2019-01-16| 90| 77| <- df1 22nd
| AGGH|2019-01-23|2019-01-17| 90| 77| \/- df2 23rd
| AGGH|2019-01-23|2019-01-18| 91| 75|
| AGGH|2019-01-23|2019-01-19| 88| 77|
| AGGH|2019-01-23|2019-01-20| 88| 77|
| AGGH|2019-01-23|2019-01-21| 88| 77|
| AGGH|2019-01-23|2019-01-22| 90| 75|
| AGGH|2019-01-23|2019-01-23| 90| 75|
| CWCA|2019-01-22|2019-01-15| 23| -2|
| CWCA|2019-01-23|2019-01-16| 7| -8|
| CWCA|2019-01-23|2019-01-17| 28| -6|
| CWCA|2019-01-23|2019-01-18| 0| -13|
| CWCA|2019-01-23|2019-01-19| 25| -15|
| CWCA|2019-01-23|2019-01-20| -4| -18|
| CWCA|2019-01-23|2019-01-21| 27| -6|
| CWCA|2019-01-22|2019-01-22| 30| 17|
| CWCA|2019-01-23|2019-01-22| 30| 13|
| CWCO|2019-01-22|2019-01-15| 34| 29|
| CWCO|2019-01-23|2019-01-16| 33| 13|
| CWCO|2019-01-22|2019-01-16| 33| 13|
| CWCO|2019-01-22|2019-01-17| 23| 7|
| CWCO|2019-01-23|2019-01-17| 23| 7|
+------------+----------+----------+-------------------+------------------+
only showing top 30 rows
It may not be best suitable answer for this case, but it's the one worked for me.
Let me know, if stuck somewhere.
BTW - if anyone can tell me how to select all columns in a df, except one without listing them in a list - I will be very thankful.
Regards G