Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?
Pandas:
df.sort_value
solution 1 add a new column row num(incremental column) and drop duplicates based the min row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)
solution 2: turn the data-frame into a rdd (df.rdd) then group the rdd on one or more or all keys and then run a lambda function on the group and drop the rows the way you want and return only the row that you are interested in.
One of my friend (sameer) mentioned that below(old solution) didn't work for him. use dropDuplicates method by default it keeps the first occurance.