Keep only duplicates from a DataFrame regarding some field

前端 未结 3 1383
青春惊慌失措
青春惊慌失措 2020-12-09 13:55

I have this spark DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Ho         


        
3条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-09 14:29

    Here is a way to do it without Window.

    A DataFrame with the duplicates

    df.exceptAll(df.drop_duplicates(['ID', 'ID2', 'Number'])).show()
    # +---+---+------+------------+------------+
    # | ID|ID2|Number|Opening_Hour|Closing_Hour|
    # +---+---+------+------------+------------+
    # |ALT|QWA|     2|    08:53:00|    23:24:00|
    # |ALT|QWA|     6|    08:55:00|    23:26:00|
    # +---+---+------+------------+------------+
    

    A DataFrame with all duplicates (using left_anti join)

    df.join(df.groupBy('ID', 'ID2', 'Number')\
              .count().where('count = 1').drop('count'),
            on=['ID', 'ID2', 'Number'],
            how='left_anti').show()
    # +---+---+------+------------+------------+
    # | ID|ID2|Number|Opening_Hour|Closing_Hour|
    # +---+---+------+------------+------------+
    # |ALT|QWA|     2|    08:54:00|    23:25:00|
    # |ALT|QWA|     2|    08:53:00|    23:24:00|
    # |ALT|QWA|     6|    08:59:00|    23:30:00|
    # |ALT|QWA|     6|    08:55:00|    23:26:00|
    # +---+---+------+------------+------------+
    

提交回复
热议问题