Aggregation in pandas

后端 未结 2 621
攒了一身酷
攒了一身酷 2020-11-22 08:16
  1. How to perform aggregation with pandas?
  2. No DataFrame after aggregation! What happened?
  3. How to aggregate mainly strings columns (to lists
2条回答
  •  北荒
    北荒 (楼主)
    2020-11-22 08:39

    If you are coming from an R or SQL background here are 3 examples that will teach you everything you need to do aggregation the way you are already familiar with:

    Let us first create a Pandas dataframe

    import pandas as pd
    
    df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
                       'key2' : ['c','c','d','d','e'],
                       'value1' : [1,2,2,3,3],
                       'value2' : [9,8,7,6,5]})
    
    df.head(5)
    

    Here is how the table we created looks like:

    |----------------|-------------|------------|------------|
    |      key1      |     key2    |    value1  |    value2  |
    |----------------|-------------|------------|------------|
    |       a        |       c     |      1     |       9    |
    |       a        |       c     |      2     |       8    |
    |       a        |       d     |      2     |       7    |
    |       b        |       d     |      3     |       6    |
    |       a        |       e     |      3     |       5    |
    |----------------|-------------|------------|------------|
    

    1. Aggregating With Row Reduction Similar to SQL Group By

    df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'), 
                                             sum_of_value_2=('value2', 'sum'),
                                             count_of_value1=('value1','size')
                                             ).reset_index()
    
    
    df_agg.head(5)
    

    The resulting data table will look like this:

    |----------------|-------------|--------------------|-------------------|---------------------|
    |      key1      |     key2    |    mean_of_value1  |    sum_of_value2  |    count_of_value1  |
    |----------------|-------------|--------------------|-------------------|---------------------|
    |       a        |      c      |         1.5        |        17         |           2         |
    |       a        |      d      |         2.0        |         7         |           1         |   
    |       a        |      e      |         3.0        |         5         |           1         |        
    |       b        |      d      |         3.0        |         6         |           1         |     
    |----------------|-------------|--------------------|-------------------|---------------------|
    

    The SQL Equivalent of this is:

    SELECT
          key1
         ,key2
         ,AVG(value1) AS mean_of_value_1
         ,SUM(value2) AS sum_of_value_2
         ,COUNT(*) AS count_of_value1
    FROM
        df
    GROUP BY
         key1
        ,key2
    

    2. Create Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)

    If you want to do a SUMIF, COUNTIF etc like how you would do in Excel where there is no reduction in rows then you need to do this instead.

    df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')
    
    df.head(5)
    

    The resulting data frame will look like this with the same number of rows as the original:

    |----------------|-------------|------------|------------|-------------------------|
    |      key1      |     key2    |    value1  |    value2  | Total_of_value1_by_key1 |
    |----------------|-------------|------------|------------|-------------------------|
    |       a        |       c     |      1     |       9    |            8            |
    |       a        |       c     |      2     |       8    |            8            |
    |       a        |       d     |      2     |       7    |            8            |
    |       b        |       d     |      3     |       6    |            3            |
    |       a        |       e     |      3     |       5    |            8            |
    |----------------|-------------|------------|------------|-------------------------|
    

    3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)

    Finally, there might be cases where you want to create a Rank column which is the SQL Equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC)

    Here is how you do that.

     df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
                  .groupby(['key1']) \
                  .cumcount() + 1
    
     df.head(5) 
    

    Note: we make the code multi-line by adding \ in the end of each line.

    Here is how the resulting data frame looks like:

    |----------------|-------------|------------|------------|------------|
    |      key1      |     key2    |    value1  |    value2  |     RN     |
    |----------------|-------------|------------|------------|------------|
    |       a        |       c     |      1     |       9    |      4     |
    |       a        |       c     |      2     |       8    |      3     |
    |       a        |       d     |      2     |       7    |      2     |
    |       b        |       d     |      3     |       6    |      1     |
    |       a        |       e     |      3     |       5    |      1     |
    |----------------|-------------|------------|------------|------------|
    

    In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.

    Other aggregating operators:

    mean() Compute mean of groups

    sum() Compute sum of group values

    size() Compute group sizes

    count() Compute count of group

    std() Standard deviation of groups

    var() Compute variance of groups

    sem() Standard error of the mean of groups

    describe() Generates descriptive statistics

    first() Compute first of group values

    last() Compute last of group values

    nth() Take nth value, or a subset if n is a list

    min() Compute min of group values

    max() Compute max of group values

    Hope this helps.

提交回复
热议问题