Aggregation in pandas

后端 未结 2 620
攒了一身酷
攒了一身酷 2020-11-22 08:16
  1. How to perform aggregation with pandas?
  2. No DataFrame after aggregation! What happened?
  3. How to aggregate mainly strings columns (to lists
2条回答
  •  生来不讨喜
    2020-11-22 08:41

    Question 1

    How to perform aggregation with pandas ?

    Expanded aggregation documentation.

    Aggregating functions are the ones that reduce the dimension of the returned objects. It means output Series/DataFrame have less or same rows like original. Some common aggregating functions are tabulated below:

    Function    Description
    mean()      Compute mean of groups
    sum()       Compute sum of group values
    size()      Compute group sizes
    count()     Compute count of group
    std()       Standard deviation of groups
    var()       Compute variance of groups
    sem()       Standard error of the mean of groups
    describe()  Generates descriptive statistics
    first()     Compute first of group values
    last()      Compute last of group values
    nth()       Take nth value, or a subset if n is a list
    min()       Compute min of group values
    max()       Compute max of group values
    
    np.random.seed(123)
    
    df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                       'B' : ['one', 'two', 'three','two', 'two', 'one'],
                       'C' : np.random.randint(5, size=6),
                       'D' : np.random.randint(5, size=6),
                       'E' : np.random.randint(5, size=6)})
    print (df)
         A      B  C  D  E
    0  foo    one  2  3  0
    1  foo    two  4  1  0
    2  bar  three  2  1  1
    3  foo    two  1  0  3
    4  bar    two  3  1  4
    5  foo    one  2  1  0
    

    Aggregation by filtered columns and cython implemented functions:

    df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    

    Aggregate function is using for all columns without specified in groupby function, here A, B columns:

    df2 = df.groupby(['A', 'B'], as_index=False).sum()
    print (df2)
         A      B  C  D  E
    0  bar  three  2  1  1
    1  bar    two  3  1  4
    2  foo    one  4  4  0
    3  foo    two  5  1  3
    

    You can also specify only some columns used for aggregation in a list after groupby function:

    df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
    print (df3)
         A      B  C  D
    0  bar  three  2  1
    1  bar    two  3  1
    2  foo    one  4  4
    3  foo    two  5  1
    

    Same results by using function DataFrameGroupBy.agg:

    df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    
    df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
    print (df2)
         A      B  C  D  E
    0  bar  three  2  1  1
    1  bar    two  3  1  4
    2  foo    one  4  4  0
    3  foo    two  5  1  3
    

    For multiple functions applied for one column use a list of tuples - names of new columns and aggregated functions:

    df4 = (df.groupby(['A', 'B'])['C']
             .agg([('average','mean'),('total','sum')])
             .reset_index())
    print (df4)
         A      B  average  total
    0  bar  three      2.0      2
    1  bar    two      3.0      3
    2  foo    one      2.0      4
    3  foo    two      2.5      5
    

    If want to pass multiple functions is possible pass list of tuples:

    df5 = (df.groupby(['A', 'B'])
             .agg([('average','mean'),('total','sum')]))
    
    print (df5)
                    C             D             E      
              average total average total average total
    A   B                                              
    bar three     2.0     2     1.0     1     1.0     1
        two       3.0     3     1.0     1     4.0     4
    foo one       2.0     4     2.0     4     0.0     0
        two       2.5     5     0.5     1     1.5     3
    
        
    

    Then get MultiIndex in columns:

    print (df5.columns)
    MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
               labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
               
    

    And for converting to columns, flattening MultiIndex use map with join:

    df5.columns = df5.columns.map('_'.join)
    df5 = df5.reset_index()
    print (df5)
         A      B  C_average  C_total  D_average  D_total  E_average  E_total
    0  bar  three        2.0        2        1.0        1        1.0        1
    1  bar    two        3.0        3        1.0        1        4.0        4
    2  foo    one        2.0        4        2.0        4        0.0        0
    3  foo    two        2.5        5        0.5        1        1.5        3
    

    Another solution is pass list of aggregate functions, then flatten MultiIndex and for another columns names use str.replace:

    df5 = df.groupby(['A', 'B']).agg(['mean','sum'])
        
    df5.columns = (df5.columns.map('_'.join)
                      .str.replace('sum','total')
                      .str.replace('mean','average'))
    df5 = df5.reset_index()
    print (df5)
         A      B  C_average  C_total  D_average  D_total  E_average  E_total
    0  bar  three        2.0        2        1.0        1        1.0        1
    1  bar    two        3.0        3        1.0        1        4.0        4
    2  foo    one        2.0        4        2.0        4        0.0        0
    3  foo    two        2.5        5        0.5        1        1.5        3
    

    If want specified each column with aggregated function separately pass dictionary:

    df6 = (df.groupby(['A', 'B'], as_index=False)
             .agg({'C':'sum','D':'mean'})
             .rename(columns={'C':'C_total', 'D':'D_average'}))
    print (df6)
         A      B  C_total  D_average
    0  bar  three        2        1.0
    1  bar    two        3        1.0
    2  foo    one        4        2.0
    3  foo    two        5        0.5
    

    You can pass custom function too:

    def func(x):
        return x.iat[0] + x.iat[-1]
    
    df7 = (df.groupby(['A', 'B'], as_index=False)
             .agg({'C':'sum','D': func})
             .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
    print (df7)
         A      B  C_total  D_sum_first_and_last
    0  bar  three        2                     2
    1  bar    two        3                     2
    2  foo    one        4                     4
    3  foo    two        5                     1
    

    Question 2

    No DataFrame after aggregation! What happened?

    Aggregation by 2 or more columns:

    df1 = df.groupby(['A', 'B'])['C'].sum()
    print (df1)
    A    B    
    bar  three    2
         two      3
    foo  one      4
         two      5
    Name: C, dtype: int32
    

    First check Index and type of pandas object:

    print (df1.index)
    MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
               labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
               names=['A', 'B'])
    
    print (type(df1))
    
    

    There are 2 solutions how get MultiIndex Series to columns:

    • add parameter as_index=False
    df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    
    • use Series.reset_index:
    df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
    print (df1)
         A      B  C
    0  bar  three  2
    1  bar    two  3
    2  foo    one  4
    3  foo    two  5
    

    If group by one column:

    df2 = df.groupby('A')['C'].sum()
    print (df2)
    A
    bar    5
    foo    9
    Name: C, dtype: int32
    

    ... get Series with Index:

    print (df2.index)
    Index(['bar', 'foo'], dtype='object', name='A')
    
    print (type(df2))
    
    

    And solution is same like in MultiIndex Series:

    df2 = df.groupby('A', as_index=False)['C'].sum()
    print (df2)
         A  C
    0  bar  5
    1  foo  9
    
    df2 = df.groupby('A')['C'].sum().reset_index()
    print (df2)
         A  C
    0  bar  5
    1  foo  9
    

    Question 3

    How to aggregate mainly strings columns (to lists, tuples, strings with separator)?

    df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                       'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                       'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                       'D' : [1,2,3,2,3,1,2]})
    print (df)
       A      B      C  D
    0  a    one  three  1
    1  c    two    one  2
    2  b  three    two  3
    3  b    two    two  2
    4  a    two  three  3
    5  c    one    two  1
    6  b  three    one  2
    

    Instead of an aggregetion function it is possible to pass list, tuple, set for converting column:

    df1 = df.groupby('A')['B'].agg(list).reset_index()
    print (df1)
       A                    B
    0  a           [one, two]
    1  b  [three, two, three]
    2  c           [two, one]
    

    Alternative is use GroupBy.apply:

    df1 = df.groupby('A')['B'].apply(list).reset_index()
    print (df1)
       A                    B
    0  a           [one, two]
    1  b  [three, two, three]
    2  c           [two, one]
    

    For converting to strings with separator use .join only if string column:

    df2 = df.groupby('A')['B'].agg(','.join).reset_index()
    print (df2)
       A                B
    0  a          one,two
    1  b  three,two,three
    2  c          two,one
    

    If numeric column use lambda function with astype for converting to strings:

    df3 = (df.groupby('A')['D']
             .agg(lambda x: ','.join(x.astype(str)))
             .reset_index())
    print (df3)
       A      D
    0  a    1,3
    1  b  3,2,2
    2  c    2,1
    

    Another solution is converting to strings before groupby:

    df3 = (df.assign(D = df['D'].astype(str))
             .groupby('A')['D']
             .agg(','.join).reset_index())
    print (df3)
       A      D
    0  a    1,3
    1  b  3,2,2
    2  c    2,1
    

    For converting all columns pass no list of column(s) after groupby. There is no column D because automatic exclusion of 'nuisance' columns, it means all numeric columns are excluded.

    df4 = df.groupby('A').agg(','.join).reset_index()
    print (df4)
       A                B            C
    0  a          one,two  three,three
    1  b  three,two,three  two,two,one
    2  c          two,one      one,two
    

    So it's necessary to convert all columns into strings, then get all columns:

    df5 = (df.groupby('A')
             .agg(lambda x: ','.join(x.astype(str)))
             .reset_index())
    print (df5)
       A                B            C      D
    0  a          one,two  three,three    1,3
    1  b  three,two,three  two,two,one  3,2,2
    2  c          two,one      one,two    2,1
    

    Question 4

    How to aggregate counts?

    df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                       'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                       'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                       'D' : [np.nan,2,3,2,3,np.nan,2]})
    print (df)
       A      B      C    D
    0  a    one  three  NaN
    1  c    two    NaN  2.0
    2  b  three    NaN  3.0
    3  b    two    two  2.0
    4  a    two  three  3.0
    5  c    one    two  NaN
    6  b  three    one  2.0
    

    Function GroupBy.size for size of each group:

    df1 = df.groupby('A').size().reset_index(name='COUNT')
    print (df1)
       A  COUNT
    0  a      2
    1  b      3
    2  c      2
    

    Function GroupBy.count exclude missing values:

    df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
    print (df2)
       A  COUNT
    0  a      2
    1  b      2
    2  c      1
    

    Function should be used fo multiple columns for count non missing values:

    df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
    print (df3)
       A  B_COUNT  C_COUNT  D_COUNT
    0  a        2        2        1
    1  b        3        2        3
    2  c        2        1        1
    

    Related function Series.value_counts return size object containing counts of unique values in descending order so that the first element is the most frequently-occurring element. Excludes NaNs values by default.

    df4 = (df['A'].value_counts()
                  .rename_axis('A')
                  .reset_index(name='COUNT'))
    print (df4)
       A  COUNT
    0  b      3
    1  a      2
    2  c      2
    

    If you want same output like using function groupby + size add Series.sort_index:

    df5 = (df['A'].value_counts()
                  .sort_index()
                  .rename_axis('A')
                  .reset_index(name='COUNT'))
    print (df5)
       A  COUNT
    0  a      2
    1  b      3
    2  c      2
    

    Question 5

    How to create new column filled by aggregated values?

    Method GroupBy.transform returns an object that is indexed the same (same size) as the one being grouped

    Pandas documentation for more information.

    np.random.seed(123)
    
    df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                        'B' : ['one', 'two', 'three','two', 'two', 'one'],
                        'C' : np.random.randint(5, size=6),
                        'D' : np.random.randint(5, size=6)})
    print (df)
         A      B  C  D
    0  foo    one  2  3
    1  foo    two  4  1
    2  bar  three  2  1
    3  foo    two  1  0
    4  bar    two  3  1
    5  foo    one  2  1
    
    
    df['C1'] = df.groupby('A')['C'].transform('sum')
    df['C2'] = df.groupby(['A','B'])['C'].transform('sum')
    
    
    df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
    df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')
    
    print (df)
    
         A      B  C  D  C1  C2  C3  D3  C4  D4
    0  foo    one  2  3   9   4   9   5   4   4
    1  foo    two  4  1   9   5   9   5   5   1
    2  bar  three  2  1   5   2   5   2   2   1
    3  foo    two  1  0   9   5   9   5   5   1
    4  bar    two  3  1   5   3   5   2   3   1
    5  foo    one  2  1   9   4   9   5   4   4
    

提交回复
热议问题