Detecting almost duplicate rows

前端 未结 2 1444
花落未央
花落未央 2021-01-14 14:08

Let\'s say I have a table that has dates and a value for each date (plus other columns). I can find the rows that have the same value on the same day by using



        
2条回答
  •  青春惊慌失措
    2021-01-14 15:01

    use numpy and triangle indexing to map all combinations

    day = df.DAY.values
    val = df.VALUE.values
    
    i, j = np.triu_indices(len(df), k=1)
    c1 = np.abs(day[i] - day[j]) < 2
    c2 = np.abs(val[i] - val[j]) < 10
    
    c = c1 & c2
    df.iloc[np.unique(np.append(i[c], j[c]))]
    
        DAY  MTH   YYY  VALUE    NAME
    1    22    9  2016   43.0    John
    6    24    8  2016   10.0    Mike
    7    24    9  2016   10.0    Mike
    8    24   10  2016   10.0    Mike
    9    24   11  2016   10.0    Mike
    10   13    9  2016  170.0  Kathie
    11   13   10  2016  170.0  Kathie
    13    8    9  2016   16.0    Gina
    14    9   10  2016   16.0    Gina
    15    8   11  2016   16.0    Gina
    17   21   11  2016   45.0    Ross
    18   23    9  2016   50.0   Shari
    19   23   10  2016   50.0   Shari
    20   23   11  2016   50.0   Shari
    

提交回复
热议问题