Combine duplicated columns within a DataFrame

前端 未结 3 1442
礼貌的吻别
礼貌的吻别 2020-12-08 10:19

If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?

3条回答
  •  感情败类
    2020-12-08 11:02

    pandas >= 0.20: df.groupby(level=0, axis=1)

    You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.

    # Setup
    np.random.seed(0)
    df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
    df
    
        A   A   B   B   B
    0  44  47   0   3   3
    1  39   9  19  21  36
    2  23   6  24  24  12
    3   1  38  39  23  46
    4  24  17  37  25  13
    

    df.groupby(level=0, axis=1).sum()
    
        A    B
    0  91    6
    1  48   76
    2  29   60
    3  39  108
    4  41   75
    

    Handling MultiIndex columns

    Another case to consider is when dealing with MultiIndex columns. Consider

    df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
    df
      one         two    
        A   A   B   B   B
    0  44  47   0   3   3
    1  39   9  19  21  36
    2  23   6  24  24  12
    3   1  38  39  23  46
    4  24  17  37  25  13
    

    To perform aggregation across the upper levels, use

    df.groupby(level=1, axis=1).sum()
    
        A    B
    0  91    6
    1  48   76
    2  29   60
    3  39  108
    4  41   75
    

    or, if aggregating per upper level only, use

    df.groupby(level=[0, 1], axis=1).sum()
    
      one     two
        A   B   B
    0  91   0   6
    1  48  19  57
    2  29  24  36
    3  39  39  69
    4  41  37  38
    

    Alternate Interpretation: Dropping Duplicate Columns

    If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:

    df.loc[:,~df.columns.duplicated()]
    
        A   B
    0  44   0
    1  39  19
    2  23  24
    3   1  39
    4  24  37
    

    Or, to keep the last ones, specify keep='last' (default is 'first'),

    df.loc[:,~df.columns.duplicated(keep='last')]
    
        A   B
    0  47   3
    1   9  36
    2   6  12
    3  38  46
    4  17  13
    

    The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.

提交回复
热议问题