How to groupby two fields in pandas?

限于喜欢 提交于 2019-12-10 18:03:46

问题


Given following input, the goal is to group values by hour for each Date with Avg and Sum functions. Solution to grouping it by hour is here, but it does not consider new days.

Date        Time    F1  F2  F3
21-01-16    8:11    5   2   4
21-01-16    9:25    9   8   2
21-01-16    9:39    7   3   2
21-01-16    9:53    6   5   1
21-01-16    10:07   4   6   7
21-01-16    10:21   7   3   1
21-01-16    10:35   5   6   7
21-01-16    11:49   1   2   1
21-01-16    12:03   3   3   1
22-01-16    9:45    6   5   1
22-01-16    9:20    4   6   7
22-01-16    12:10   7   3   1

Expected output:

Date,Time,SUM F1,SUM F2,SUM F3,AVG F1,AVG F2,AVG F3
21-01-16,8:00,5,2,4,5,2,4
21-01-16,9:00,22,16,5,7.3,5.3,1.6
21-01-16,10:00,16,15,15,5.3,5,5
21-01-16,11:00,1,2,1,1,2,1
21-01-16,12:00,3,3,1,3,3,1
22-01-16,9:00,10,11,8,5,5.5,4
22-01-16,12:00,7,3,1,7,3,1

回答1:


You can do the parsing of dates during reading of the csv file:

from __future__ import print_function # make it work with Python 2 and 3

df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
                 delim_whitespace=True)
print(df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum']))

Output:

                       F1            F2            F3    
                     mean sum      mean sum      mean sum
Date       Time                                          
2016-01-21 8     5.000000   5  2.000000   2  4.000000   4
           9     7.333333  22  5.333333  16  1.666667   5
           10    5.333333  16  5.000000  15  5.000000  15
           11    1.000000   1  2.000000   2  1.000000   1
           12    3.000000   3  3.000000   3  1.000000   1
2016-01-22 9     5.000000  10  5.500000  11  4.000000   8
           12    7.000000   7  3.000000   3  1.000000   1

All the way into csv:

from __future__ import print_function

df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
                 delim_whitespace=True)
df2 = df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum'])
df3 = df2.reset_index()
df3.columns = [' '.join(col).strip() for col in df3.columns.values]
print(df3.to_csv(columns=df3.columns, index=False))

Output:

Date,Time,F1 mean,F1 sum,F2 mean,F2 sum,F3 mean,F3 sum
2016-01-21,8,5.0,5,2.0,2,4.0,4
2016-01-21,9,7.333333333333333,22,5.333333333333333,16,1.6666666666666667,5
2016-01-21,10,5.333333333333333,16,5.0,15,5.0,15
2016-01-21,11,1.0,1,2.0,2,1.0,1
2016-01-21,12,3.0,3,3.0,3,1.0,1
2016-01-22,9,5.0,10,5.5,11,4.0,8
2016-01-22,12,7.0,7,3.0,3,1.0,1 



回答2:


You cas use convert time to datetime by to_datetime and then groupby with agg:

print df    
         Date   Time  F1  F2  F3
0  2016-01-21   8:11   5   2   4
1  2016-01-21   9:25   9   8   2
2  2016-01-21   9:39   7   3   2
3  2016-01-21   9:53   6   5   1
4  2016-01-21  10:07   4   6   7
5  2016-01-21  10:21   7   3   1
6  2016-01-21  10:35   5   6   7
7  2016-01-21  11:49   1   2   1
8  2016-01-21  12:03   3   3   1
9  2016-01-22   9:45   6   5   1
10 2016-01-22   9:20   4   6   7
11 2016-01-22  12:10   7   3   1

df['Time'] = pd.to_datetime(df['Time'], format="%H:%M")
print df
         Date                Time  F1  F2  F3
0  2016-01-21 1900-01-01 08:11:00   5   2   4
1  2016-01-21 1900-01-01 09:25:00   9   8   2
2  2016-01-21 1900-01-01 09:39:00   7   3   2
3  2016-01-21 1900-01-01 09:53:00   6   5   1
4  2016-01-21 1900-01-01 10:07:00   4   6   7
5  2016-01-21 1900-01-01 10:21:00   7   3   1
6  2016-01-21 1900-01-01 10:35:00   5   6   7
7  2016-01-21 1900-01-01 11:49:00   1   2   1
8  2016-01-21 1900-01-01 12:03:00   3   3   1
9  2016-01-22 1900-01-01 09:45:00   6   5   1
10 2016-01-22 1900-01-01 09:20:00   4   6   7
11 2016-01-22 1900-01-01 12:10:00   7   3   1
df = df.groupby([df['Date'], df['Time'].dt.hour]).agg(['mean','sum']).reset_index()
print df
        Date Time        F1            F2            F3    
                       mean sum      mean sum      mean sum
0 2016-01-21    8  5.000000   5  2.000000   2  4.000000   4
1 2016-01-21    9  7.333333  22  5.333333  16  1.666667   5
2 2016-01-21   10  5.333333  16  5.000000  15  5.000000  15
3 2016-01-21   11  1.000000   1  2.000000   2  1.000000   1
4 2016-01-21   12  3.000000   3  3.000000   3  1.000000   1
5 2016-01-22    9  5.000000  10  5.500000  11  4.000000   8
6 2016-01-22   12  7.000000   7  3.000000   3  1.000000   1

And then you can set column names by list comprehension:

levels = df.columns.levels
labels = df.columns.labels
df.columns = [ x + " " + y for x, y in  zip(levels[0][labels[0]],df.columns.droplevel(0))]
print df

        Date   Time   F1 mean  F1 sum   F2 mean  F2 sum   F3 mean  F3 sum
0 2016-01-21      8  5.000000       5  2.000000       2  4.000000       4
1 2016-01-21      9  7.333333      22  5.333333      16  1.666667       5
2 2016-01-21     10  5.333333      16  5.000000      15  5.000000      15
3 2016-01-21     11  1.000000       1  2.000000       2  1.000000       1
4 2016-01-21     12  3.000000       3  3.000000       3  1.000000       1
5 2016-01-22      9  5.000000      10  5.500000      11  4.000000       8
6 2016-01-22     12  7.000000       7  3.000000       3  1.000000       1


来源:https://stackoverflow.com/questions/34944183/how-to-groupby-two-fields-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!