问题
my pandas dataframe looks like the below
country date gd
US 01-01-2014 2
US 01-01-2015 3
US 01-01-2013 0.4
UK 01-01-2000 0.7
UK 02-01-2001 0.5
UK 01-01-2016 1
what I want to do is :
1) Fill all dates (daily) starting from each countries minimum date so say for US it is 01-01-2013 upto today and for UK it is 01-01-2000 daily upto today.
2) Fill gd column with previous available data
many thanks for your help
回答1:
In [67]: today = pd.to_datetime(pd.datetime.now()).normalize()
In [68]: l = df.country.nunique()
In [72]: df.append(pd.DataFrame({'country':df.country.unique(), 'date':[today]*l, 'gd':[np.nan]*l})) \
...: .sort_values('date') \
...: .groupby('country') \
...: .resample('1D', on='date') \
...: .mean() \
...: .reset_index() \
...: .ffill()
...:
Out[72]:
country date gd
0 UK 2000-01-01 0.7
1 UK 2000-01-02 0.7
2 UK 2000-01-03 0.7
3 UK 2000-01-04 0.7
4 UK 2000-01-05 0.7
5 UK 2000-01-06 0.7
6 UK 2000-01-07 0.7
7 UK 2000-01-08 0.7
8 UK 2000-01-09 0.7
9 UK 2000-01-10 0.7
... ... ... ...
8059 US 2017-07-09 3.0
8060 US 2017-07-10 3.0
8061 US 2017-07-11 3.0
8062 US 2017-07-12 3.0
8063 US 2017-07-13 3.0
8064 US 2017-07-14 3.0
8065 US 2017-07-15 3.0
8066 US 2017-07-16 3.0
8067 US 2017-07-17 3.0
8068 US 2017-07-18 3.0
[8069 rows x 3 columns]
回答2:
s = df.set_index(['country', 'date']).gd
today = pd.datetime.today()
def then2now(x):
x = x.xs(x.name)
mn = x.index.min()
return x.reindex(pd.date_range(mn, today, name='date')).ffill()
s.groupby(level='country').apply(then2now).reset_index()
country date gd
0 UK 2000-01-01 0.7
400 UK 2001-02-04 0.5
800 UK 2002-03-11 0.5
1200 UK 2003-04-15 0.5
1600 UK 2004-05-19 0.5
2000 UK 2005-06-23 0.5
2400 UK 2006-07-28 0.5
2800 UK 2007-09-01 0.5
3200 UK 2008-10-05 0.5
3600 UK 2009-11-09 0.5
4000 UK 2010-12-14 0.5
4400 UK 2012-01-18 0.5
4800 UK 2013-02-21 0.5
5200 UK 2014-03-28 0.5
5600 UK 2015-05-02 0.5
6000 UK 2016-06-05 1.0
6400 UK 2017-07-10 1.0
6800 US 2014-01-27 2.0
7200 US 2015-03-03 3.0
7600 US 2016-04-06 3.0
8000 US 2017-05-11 3.0
回答3:
You could make date the index and then use reindex to expand the dates and ffill to forward-fill the NaNs:
def expand_dates(grp):
start = grp.index.min()
end = today
index = pd.date_range(start, end, freq='D')
return grp.reindex(index).ffill()
Use groupby/apply to call expand_dates once for each group and concatenate the results:
df = df.groupby('country')['gd'].apply(expand_dates)
Correction: My first answer forward-filled the entire DataFrame as the last step: df = df.ffill(). That is correct only if each country's first gd value is not NaN. If the starting row(s) for a certain country have NaN gd value(s), then forward-filling may contaminate those gd values with values from another country. Yikes. The more robust and correct method would be to forward-fill once for each group as shown by piRSquared. Any performance gain achieved by forward-filling once instead of many times on smaller DataFrames would be minor since the number of ffill calls is limited by the number of countries (a pretty low number) and safe-guarding against a potential bug is far more important than the limited performance gain that is possible.
import numpy as np
import pandas as pd
df = pd.DataFrame({'country': ['US', 'US', 'US', 'UK', 'UK', 'UK'], 'date': ['01-01-2014', '01-01-2015', '01-01-2013', '01-01-2000', '02-01-2001', '01-01-2016'], 'gd': [2.0, 3.0, 0.4, 0.7, 0.5, 1.0]})
df['date'] = pd.to_datetime(df['date'])
today = pd.Timestamp('today')
def expand_dates(grp):
start = grp.index.min()
end = today
index = pd.date_range(start, end, freq='D')
return grp.reindex(index).ffill()
df = df.set_index('date')
df = df.groupby('country')['gd'].apply(expand_dates)
print(pd.concat([df.head(), df.tail()]))
yields
country
UK 2000-01-01 0.7
2000-01-02 0.7
2000-01-03 0.7
2000-01-04 0.7
2000-01-05 0.7
US 2017-07-14 3.0
2017-07-15 3.0
2017-07-16 3.0
2017-07-17 3.0
2017-07-18 3.0
Name: gd, dtype: float64
来源:https://stackoverflow.com/questions/45176903/fill-in-dates-and-use-previous-values