pandas - Merge nearly duplicate rows based on column value

后端 未结 3 1220
小鲜肉
小鲜肉 2020-11-27 11:34

I have a pandas dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or \"coalesce\" these rows into a

3条回答
  •  野性不改
    2020-11-27 12:18

    I was using some code that I didn't think was optimal and eventually found jezrael's answer. But after using it and running a timeit test, I actually went back to what I was doing, which was:

    cmnts = {}
    for i, row in df.iterrows():
        while True:
            try:
                if row['Use_Case']:
                    cmnts[row['Name']].append(row['Use_Case'])
    
                else:
                    cmnts[row['Name']].append('n/a')
    
                break
    
            except KeyError:
                cmnts[row['Name']] = []
    
    df.drop_duplicates('Name', inplace=True)
    df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]
    

    According to my 100 run timeit test, the iterate and replace method is an order of magnitude faster than the groupby method.

    import pandas as pd
    from my_stuff import time_something
    
    df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
                       'b': [i for i in range(1, 10001)]})
    
    runs = 100
    
    interim_dict = 'txt = {}\n' \
                   'for i, row in df.iterrows():\n' \
                   '    try:\n' \
                   "        txt[row['a']].append(row['b'])\n\n" \
                   '    except KeyError:\n' \
                   "        txt[row['a']] = []\n" \
                   "df.drop_duplicates('a', inplace=True)\n" \
                   "df['b'] = ['; '.join(v) for v in txt.values()]"
    
    grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"
    
    print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
    print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))
    

    yields:

    Interim Dict
      Total: 59.1164s
      Avg: 591163748.5887ns
    
    Group By
      Total: 430.6203s
      Avg: 4306203366.1827ns
    

    where time_something is a function which times a snippet with timeit and returns the result in the above format.

提交回复
热议问题