Pandas: resample timeseries with groupby

匿名 (未验证) 提交于 2019-12-03 02:14:01

问题:

Given the below pandas DataFrame:

In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00',                                             '2014-08-25 22:07:00','2014-08-25 22:09:00']))           locations = ['HK', 'LDN', 'LDN', 'LDN']           event = ['foo', 'bar', 'baz', 'qux']           df = pd.DataFrame({'Location': locations,                              'Event': event}, index=times)           df Out[115]:                                Event Location           2014-08-25 21:00:00  foo   HK           2014-08-25 21:04:00  bar   LDN           2014-08-25 22:07:00  baz   LDN           2014-08-25 22:09:00  qux   LDN 

I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:

Out[115]:                                HK    LDN           2014-08-25 21:00:00  1     1           2014-08-25 22:00:00  0     2 

I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?

回答1:

You could use a pd.TimeGrouper to group the DatetimeIndex'ed DataFrame by hour:

grouper = df.groupby([pd.TimeGrouper('1H'), 'Location']) 

use count to count the number of events in each group:

grouper['Event'].count() #                      Location # 2014-08-25 21:00:00  HK          1 #                      LDN         1 # 2014-08-25 22:00:00  LDN         2 # Name: Event, dtype: int64 

use unstack to move the Location index level to a column level:

grouper['Event'].count().unstack() # Out[49]:  # Location             HK  LDN # 2014-08-25 21:00:00   1    1 # 2014-08-25 22:00:00 NaN    2 

and then use fillna to change the NaNs into zeros.


Putting it all together,

grouper = df.groupby([pd.TimeGrouper('1H'), 'Location']) result = grouper['Event'].count().unstack('Location').fillna(0) 

yields

Location             HK  LDN 2014-08-25 21:00:00   1    1 2014-08-25 22:00:00   0    2 


回答2:

Pandas 0.21 answer: TimeGrouper is getting deprecated

There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.

Option 1: Use groupby + resample

grouped = df.groupby('Location').resample('H')['Event'].count() 

Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)

grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count() 

They both will result in the following:

Location                      HK        2014-08-25 21:00:00    1 LDN       2014-08-25 21:00:00    1           2014-08-25 22:00:00    2 Name: Event, dtype: int64 

And then reshape:

grouped.unstack('Location', fill_value=0) 

Will output

Location             HK  LDN 2014-08-25 21:00:00   1    1 2014-08-25 22:00:00   0    2 


回答3:

Multiple Column Group By

untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Cost and wanted to aggregate it like above. It was through combining unutbu's answer and this one that I found out how to do this and thought I would share for future users.

Create DataFrame with Cost colunm.

In[1]:        import pandas as pd        times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00',                   '2014-08-25 21:04:00','2014-08-25 22:07:00','2014-08-25 22:09:00']))        locations = ['HK', 'LDN', 'LDN', 'LDN']       event = ['foo', 'bar', 'baz', 'qux']       cost = [20, 24, 34, 52] # add in cost colunm        df = pd.DataFrame({'Location': locations, 'Event': event, 'Cost': cost}, index=times)       df  Out[1]:                                Event Location Cost           2014-08-25 21:00:00  foo   HK       20           2014-08-25 21:04:00  bar   LDN      24           2014-08-25 22:07:00  baz   LDN      34           2014-08-25 22:09:00  qux   LDN      52 

Now we group by using the agg function to specify each column aggregation method, i.e. count, mean, sum etc.

In[2]:           df = df.groupby([pd.TimeGrouper('1H'), 'Location']).agg({'Event': np.sum,                                                                'Cost': np.mean})  Out[2]:                                Location    Event     Cost           2014-08-25 21:00:00  HK          1         20                                LDN         1         24           2014-08-25 22:00:00  LDN         2         43 

Then the final unstack with fill NaN with zeros and display as int because it's nice.

In[3]:        df.df.unstack().fillna(0).astype(int)  Out[3]:                                  Cost      Event                     Location    HK  LDN   HK   LDN          2014-08-25 21:00:00    20  24    1    1          2014-08-25 22:00:00    0   43    0    2 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!