Keep only duplicates from a DataFrame regarding some field

前端 未结 3 1382
青春惊慌失措
青春惊慌失措 2020-12-09 13:55

I have this spark DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Ho         


        
3条回答
  •  甜味超标
    2020-12-09 14:19

    One way to do this is by using a pyspark.sql.Window to add a column that counts the number of duplicates for each row's ("ID", "ID2", "Name") combination. Then select only the rows where the number of duplicate is greater than 1.

    import pyspark.sql.functions as f
    from pyspark.sql import Window
    
    w = Window.partitionBy('ID', 'ID2', 'Number')
    df.select('*', f.count('ID').over(w).alias('dupeCount'))\
        .where('dupeCount > 1')\
        .drop('dupeCount')\
        .show()
    #+---+---+------+----+------------+------------+
    #| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
    #+---+---+------+----+------------+------------+
    #|ALT|QWA|     2|null|    08:54:00|    23:25:00|
    #|ALT|QWA|     2|null|    08:53:00|    23:24:00|
    #|ALT|QWA|     6|null|    08:59:00|    23:30:00|
    #|ALT|QWA|     6|null|    08:55:00|    23:26:00|
    #+---+---+------+----+------------+------------+
    

    I used pyspark.sql.functions.count() to count the number of items in each group. This returns a DataFrame containing all of the duplicates (the second output you showed).

    If you wanted to get only one row per ("ID", "ID2", "Name") combination, you could do using another Window to order the rows.

    For example, below I add another column for the row_number and select only the rows where the duplicate count is greater than 1 and the row number is equal to 1. This guarantees one row per grouping.

    w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
    df.select(
            '*',
            f.count('ID').over(w).alias('dupeCount'),
            f.row_number().over(w2).alias('rowNum')
        )\
        .where('(dupeCount > 1) AND (rowNum = 1)')\
        .drop('dupeCount', 'rowNum')\
        .show()
    #+---+---+------+----+------------+------------+
    #| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
    #+---+---+------+----+------------+------------+
    #|ALT|QWA|     2|null|    08:54:00|    23:25:00|
    #|ALT|QWA|     6|null|    08:59:00|    23:30:00|
    #+---+---+------+----+------------+------------+
    

提交回复
热议问题