How to find duplicate based upon multiple columns in a rolling window in pandas?

前端 未结 2 1835
南旧
南旧 2020-12-22 07:23

Sample Data

{\"transaction\": {\"merchant\": \"merchantA\", \"amount\": 20, \"time\": \"2019-02-13T10:00:00.000Z\"}}
{\"transaction\": {\"me         


        
相关标签:
2条回答
  • 2020-12-22 08:10

    First, you could form rolling 120 second blocs of data. You could then apply;

    block and evaluate using duplicated: df = df[df.duplicated(subset=['val1','val2',’val3’], keep=False)]

    Or groupby: df.groupby(['val1','val2',’val3’]).count()

    Or even a SQL distinct. https://www.w3schools.com/sql/sql_distinct.asp

    Please post what you have tried. The above methods work for strings, floats, datetimes and integer data types.

    0 讨论(0)
  • 2020-12-22 08:14

    So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.

    My solution snippet to the problem:

        if len(df.index) > 0:
            res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
            res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
            if res.timediff.any():
                continue
        df = df.append(df1)
    print(df)
    

    Sample data:

    {"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
    {"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
    {"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
    {"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
    {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
    {"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
    {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
    {"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
    {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
    {"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
    {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}
    

    Output:

                          merchant  amount                time
    2019-02-13 10:00:00  merchantA      20 2019-02-13 10:00:00
    2019-02-13 11:00:01  merchantB      90 2019-02-13 11:00:01
    2019-02-13 11:00:10  merchantC      10 2019-02-13 11:00:10
    2019-02-13 11:00:20  merchantD      10 2019-02-13 11:00:20
    2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30
    2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00
    2019-02-13 11:05:20  merchantF      10 2019-02-13 11:05:20
    
    0 讨论(0)
提交回复
热议问题