How can I add summary rows to a pandas DataFrame calculated on multiple columns by agg functions like mean, median, etc

问题

I have some data with multiple observations for a given Collector, Date, Sample, and Type where the observation values vary by ID.

import StringIO
import pandas as pd

data = """Collector,Date,Sample,Type,ID,Value
Emily,2014-06-20,201,HV,A,34
Emily,2014-06-20,201,HV,B,22
Emily,2014-06-20,201,HV,C,10
Emily,2014-06-20,201,HV,D,5
John,2014-06-22,221,HV,A,40
John,2014-06-22,221,HV,B,39
John,2014-06-22,221,HV,C,11
John,2014-06-22,221,HV,D,2
Emily,2014-06-23,203,HV,A,33
Emily,2014-06-23,203,HV,B,35
Emily,2014-06-23,203,HV,C,13
Emily,2014-06-23,203,HV,D,1
John,2014-07-01,218,HV,A,35
John,2014-07-01,218,HV,B,29
John,2014-07-01,218,HV,C,13
John,2014-07-01,218,HV,D,1
"""

>>> df = pd.read_csv(StringIO.StringIO(data), parse_dates="Date")

After doing some graphing with the data in this long format, I pivot it to a wide summary table format with columns for each ID.

>>> table = df.pivot_table(index=["Collector", "Date", "Sample", "Type"], columns="ID", values="Value")

ID                                 A   B   C  D
Collector Date       Sample Type               
Emily     2014-06-20 201    HV    34  22  10  5
          2014-06-23 203    HV    33  35  13  1
John      2014-06-22 221    HV    40  39  11  2
          2014-07-01 218    HV    35  29  13  1

However, I can't find a concise way to calculate and add some summary rows to the wide format data with mean, median, and maybe some custom aggregation function applied to each of the ID-based columns. This is what I want to end up with:

ID Collector        Date  Sample Type   A    B    C   D
0      Emily  2014-06-20     201   HV  34   22   10   5
2       John  2014-06-22     221   HV  40   39   11   2
1      Emily  2014-06-23     203   HV  33   35   13   1
3       John  2014-07-01     218   HV  35   29   13   1
4       mean                          35.5 31.3 11.8 2.3
5     median                          34.5 32.0 12.0 1.5

I tried things like calling mean or median on the summary table, but I end up with a Series rather than a row I can concatenate to the summary table. The summary rows I want are sort of like pivot_table margins, but the aggregation function is not sum.

>>> table.mean()
ID
A     35.50
B     31.25
C     11.75
D      2.25
dtype: float64

>>> table.median()
ID
A     34.5
B     32.0
C     12.0
D      1.5
dtype: float64

回答1:

You could use aggfunc=[np.mean, np.median] to compute both the means and the medians. Then you could use margins=True to also obtain the means and medians for each column and for each row.

result = df.pivot_table(index=["Collector", "Date", "Sample", "Type"], 
    columns="ID", values="Value", margins=True, 
    aggfunc=[np.mean, np.median]).stack(level=0)

yields

ID                                          A      B      C     D      All
Collector Date       Sample Type                                          
Emily     2014-06-20 201    HV   mean    34.0  22.00  10.00  5.00  17.7500
                                 median  34.0  22.00  10.00  5.00  16.0000
          2014-06-23 203    HV   mean    33.0  35.00  13.00  1.00  20.5000
                                 median  33.0  35.00  13.00  1.00  23.0000
John      2014-06-22 221    HV   mean    40.0  39.00  11.00  2.00  23.0000
                                 median  40.0  39.00  11.00  2.00  25.0000
          2014-07-01 218    HV   mean    35.0  29.00  13.00  1.00  19.5000
                                 median  35.0  29.00  13.00  1.00  21.0000
All                              mean    35.5  31.25  11.75  2.25  20.1875
                                 median  34.5  32.00  12.00  1.50  17.5000

Yes, result contains more data than you asked for, but

result.loc['All']

has the additional values:

ID                          A      B      C     D      All
Date Sample Type                                          
                 mean    35.5  31.25  11.75  2.25  20.1875
                 median  34.5  32.00  12.00  1.50  17.5000

Or, you could further subselect result to get just the rows you are looking for:

result.index.names = [u'Collector', u'Date', u'Sample', u'Type', u'aggfunc']
mask = result.index.get_level_values('aggfunc') == 'mean'
mask[-1] = True
result = result.loc[mask]
print(result)

yields

ID                                           A      B      C     D      All
Collector Date       Sample Type aggfunc                                   
Emily     2014-06-20 201    HV   mean     34.0  22.00  10.00  5.00  17.7500
          2014-06-23 203    HV   mean     33.0  35.00  13.00  1.00  20.5000
John      2014-06-22 221    HV   mean     40.0  39.00  11.00  2.00  23.0000
          2014-07-01 218    HV   mean     35.0  29.00  13.00  1.00  19.5000
All                              mean     35.5  31.25  11.75  2.25  20.1875
                                 median   34.5  32.00  12.00  1.50  17.5000

回答2:

This might not be super clean, but you could assign to the new entries with .loc.

In [131]: table_mean = table.mean()

In [132]: table_median = table.median()

In [134]: table.loc['Mean', :] = table_mean.values

In [135]: table.loc['Median', :] = table_median.values

In [136]: table
Out[136]: 
ID                                   A      B      C     D
Collector Date       Sample Type                          
Emily     2014-06-20 201    HV    34.0  22.00  10.00  5.00
          2014-06-23 203    HV    33.0  35.00  13.00  1.00
John      2014-06-22 221    HV    40.0  39.00  11.00  2.00
          2014-07-01 218    HV    35.0  29.00  13.00  1.00
Mean                              35.5  31.25  11.75  2.25
Median                            34.5  32.00  12.00  1.50

来源：https://stackoverflow.com/questions/26025607/how-can-i-add-summary-rows-to-a-pandas-dataframe-calculated-on-multiple-columns

标签

python

numpy

pandas