spark dataframe drop duplicates and keep first

后端 未结 5 714
孤街浪徒
孤街浪徒 2020-12-05 02:47

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?

Pandas:

df.sort_value         


        
5条回答
  •  臣服心动
    2020-12-05 03:01

    solution 1 add a new column row num(incremental column) and drop duplicates based the min row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)

    solution 2: turn the data-frame into a rdd (df.rdd) then group the rdd on one or more or all keys and then run a lambda function on the group and drop the rows the way you want and return only the row that you are interested in.

    One of my friend (sameer) mentioned that below(old solution) didn't work for him. use dropDuplicates method by default it keeps the first occurance.

提交回复
热议问题