可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Given the below pandas DataFrame:
In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00', '2014-08-25 22:07:00','2014-08-25 22:09:00'])) locations = ['HK', 'LDN', 'LDN', 'LDN'] event = ['foo', 'bar', 'baz', 'qux'] df = pd.DataFrame({'Location': locations, 'Event': event}, index=times) df Out[115]: Event Location 2014-08-25 21:00:00 foo HK 2014-08-25 21:04:00 bar LDN 2014-08-25 22:07:00 baz LDN 2014-08-25 22:09:00 qux LDN
I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:
Out[115]: HK LDN 2014-08-25 21:00:00 1 1 2014-08-25 22:00:00 0 2
I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?
回答1:
You could use a pd.TimeGrouper
to group the DatetimeIndex'ed DataFrame by hour:
grouper = df.groupby([pd.TimeGrouper('1H'), 'Location'])
use count
to count the number of events in each group:
grouper['Event'].count() # Location # 2014-08-25 21:00:00 HK 1 # LDN 1 # 2014-08-25 22:00:00 LDN 2 # Name: Event, dtype: int64
use unstack
to move the Location
index level to a column level:
grouper['Event'].count().unstack() # Out[49]: # Location HK LDN # 2014-08-25 21:00:00 1 1 # 2014-08-25 22:00:00 NaN 2
and then use fillna
to change the NaNs into zeros.
Putting it all together,
grouper = df.groupby([pd.TimeGrouper('1H'), 'Location']) result = grouper['Event'].count().unstack('Location').fillna(0)
yields
Location HK LDN 2014-08-25 21:00:00 1 1 2014-08-25 22:00:00 0 2
回答2:
There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.
Option 1: Use groupby + resample
grouped = df.groupby('Location').resample('H')['Event'].count()
Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)
grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()
They both will result in the following:
Location HK 2014-08-25 21:00:00 1 LDN 2014-08-25 21:00:00 1 2014-08-25 22:00:00 2 Name: Event, dtype: int64
And then reshape:
grouped.unstack('Location', fill_value=0)
Will output
Location HK LDN 2014-08-25 21:00:00 1 1 2014-08-25 22:00:00 0 2
回答3:
Multiple Column Group By
untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Cost
and wanted to aggregate it like above. It was through combining unutbu's answer and this one that I found out how to do this and thought I would share for future users.
Create DataFrame with Cost
colunm.
In[1]: import pandas as pd times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00', '2014-08-25 21:04:00','2014-08-25 22:07:00','2014-08-25 22:09:00'])) locations = ['HK', 'LDN', 'LDN', 'LDN'] event = ['foo', 'bar', 'baz', 'qux'] cost = [20, 24, 34, 52] # add in cost colunm df = pd.DataFrame({'Location': locations, 'Event': event, 'Cost': cost}, index=times) df Out[1]: Event Location Cost 2014-08-25 21:00:00 foo HK 20 2014-08-25 21:04:00 bar LDN 24 2014-08-25 22:07:00 baz LDN 34 2014-08-25 22:09:00 qux LDN 52
Now we group by using the agg
function to specify each column aggregation method, i.e. count, mean, sum etc.
In[2]: df = df.groupby([pd.TimeGrouper('1H'), 'Location']).agg({'Event': np.sum, 'Cost': np.mean}) Out[2]: Location Event Cost 2014-08-25 21:00:00 HK 1 20 LDN 1 24 2014-08-25 22:00:00 LDN 2 43
Then the final unstack
with fill NaN
with zeros and display as int
because it's nice.
In[3]: df.df.unstack().fillna(0).astype(int) Out[3]: Cost Event Location HK LDN HK LDN 2014-08-25 21:00:00 20 24 1 1 2014-08-25 22:00:00 0 43 0 2