Aggregation in pandas

后端未结

关注

 2  621

攒了一身酷 2020-11-22 08:16

How to perform aggregation with pandas?
No DataFrame after aggregation! What happened?
How to aggregate mainly strings columns (to lists

2条回答

北荒 (楼主)

2020-11-22 08:39

If you are coming from an R or SQL background here are 3 examples that will teach you everything you need to do aggregation the way you are already familiar with:

Let us first create a Pandas dataframe

import pandas as pd

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
                   'key2' : ['c','c','d','d','e'],
                   'value1' : [1,2,2,3,3],
                   'value2' : [9,8,7,6,5]})

df.head(5)

Here is how the table we created looks like:

|----------------|-------------|------------|------------|
|      key1      |     key2    |    value1  |    value2  |
|----------------|-------------|------------|------------|
|       a        |       c     |      1     |       9    |
|       a        |       c     |      2     |       8    |
|       a        |       d     |      2     |       7    |
|       b        |       d     |      3     |       6    |
|       a        |       e     |      3     |       5    |
|----------------|-------------|------------|------------|

1. Aggregating With Row Reduction Similar to SQL `Group By`

df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'), 
                                         sum_of_value_2=('value2', 'sum'),
                                         count_of_value1=('value1','size')
                                         ).reset_index()


df_agg.head(5)

The resulting data table will look like this:

|----------------|-------------|--------------------|-------------------|---------------------|
|      key1      |     key2    |    mean_of_value1  |    sum_of_value2  |    count_of_value1  |
|----------------|-------------|--------------------|-------------------|---------------------|
|       a        |      c      |         1.5        |        17         |           2         |
|       a        |      d      |         2.0        |         7         |           1         |   
|       a        |      e      |         3.0        |         5         |           1         |        
|       b        |      d      |         3.0        |         6         |           1         |     
|----------------|-------------|--------------------|-------------------|---------------------|

The SQL Equivalent of this is:

SELECT
      key1
     ,key2
     ,AVG(value1) AS mean_of_value_1
     ,SUM(value2) AS sum_of_value_2
     ,COUNT(*) AS count_of_value1
FROM
    df
GROUP BY
     key1
    ,key2

2. Create Column Without Reduction in Rows (`EXCEL - SUMIF, COUNTIF`)

If you want to do a SUMIF, COUNTIF etc like how you would do in Excel where there is no reduction in rows then you need to do this instead.

df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')

df.head(5)

The resulting data frame will look like this with the same number of rows as the original:

|----------------|-------------|------------|------------|-------------------------|
|      key1      |     key2    |    value1  |    value2  | Total_of_value1_by_key1 |
|----------------|-------------|------------|------------|-------------------------|
|       a        |       c     |      1     |       9    |            8            |
|       a        |       c     |      2     |       8    |            8            |
|       a        |       d     |      2     |       7    |            8            |
|       b        |       d     |      3     |       6    |            3            |
|       a        |       e     |      3     |       5    |            8            |
|----------------|-------------|------------|------------|-------------------------|

3. Creating a RANK Column `ROW_NUMBER() OVER (PARTITION BY ORDER BY)`

Finally, there might be cases where you want to create a Rank column which is the SQL Equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC)

Here is how you do that.

 df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
              .groupby(['key1']) \
              .cumcount() + 1

 df.head(5)

Note: we make the code multi-line by adding \ in the end of each line.

Here is how the resulting data frame looks like:

|----------------|-------------|------------|------------|------------|
|      key1      |     key2    |    value1  |    value2  |     RN     |
|----------------|-------------|------------|------------|------------|
|       a        |       c     |      1     |       9    |      4     |
|       a        |       c     |      2     |       8    |      3     |
|       a        |       d     |      2     |       7    |      2     |
|       b        |       d     |      3     |       6    |      1     |
|       a        |       e     |      3     |       5    |      1     |
|----------------|-------------|------------|------------|------------|

In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.

Other aggregating operators:

mean() Compute mean of groups

sum() Compute sum of group values

size() Compute group sizes

count() Compute count of group

std() Standard deviation of groups

var() Compute variance of groups

sem() Standard error of the mean of groups

describe() Generates descriptive statistics

first() Compute first of group values

last() Compute last of group values

nth() Take nth value, or a subset if n is a list

min() Compute min of group values

max() Compute max of group values

Hope this helps.

0 讨论(0)

查看其它2个回答

Aggregation in pandas

1. Aggregating With Row Reduction Similar to SQL Group By

2. Create Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)

3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)

Other aggregating operators:

1. Aggregating With Row Reduction Similar to SQL `Group By`

2. Create Column Without Reduction in Rows (`EXCEL - SUMIF, COUNTIF`)

3. Creating a RANK Column `ROW_NUMBER() OVER (PARTITION BY ORDER BY)`