fill in dates and use previous values

问题

my pandas dataframe looks like the below

 country     date           gd  
 US          01-01-2014      2
 US          01-01-2015      3
 US          01-01-2013      0.4
 UK          01-01-2000      0.7
 UK          02-01-2001      0.5
 UK          01-01-2016      1

what I want to do is :

1) Fill all dates (daily) starting from each countries minimum date so say for US it is 01-01-2013 upto today and for UK it is 01-01-2000 daily upto today.

2) Fill gd column with previous available data

many thanks for your help

回答1:

In [67]: today = pd.to_datetime(pd.datetime.now()).normalize()

In [68]: l = df.country.nunique()

In [72]: df.append(pd.DataFrame({'country':df.country.unique(), 'date':[today]*l, 'gd':[np.nan]*l})) \
    ...:   .sort_values('date') \
    ...:   .groupby('country') \
    ...:   .resample('1D', on='date') \
    ...:   .mean() \
    ...:   .reset_index() \
    ...:   .ffill()
    ...:
Out[72]:
     country       date   gd
0         UK 2000-01-01  0.7
1         UK 2000-01-02  0.7
2         UK 2000-01-03  0.7
3         UK 2000-01-04  0.7
4         UK 2000-01-05  0.7
5         UK 2000-01-06  0.7
6         UK 2000-01-07  0.7
7         UK 2000-01-08  0.7
8         UK 2000-01-09  0.7
9         UK 2000-01-10  0.7
...      ...        ...  ...
8059      US 2017-07-09  3.0
8060      US 2017-07-10  3.0
8061      US 2017-07-11  3.0
8062      US 2017-07-12  3.0
8063      US 2017-07-13  3.0
8064      US 2017-07-14  3.0
8065      US 2017-07-15  3.0
8066      US 2017-07-16  3.0
8067      US 2017-07-17  3.0
8068      US 2017-07-18  3.0

[8069 rows x 3 columns]

回答2:

s = df.set_index(['country', 'date']).gd

today = pd.datetime.today()

def then2now(x):
    x = x.xs(x.name)
    mn = x.index.min()
    return x.reindex(pd.date_range(mn, today, name='date')).ffill()

s.groupby(level='country').apply(then2now).reset_index()

     country       date   gd
0         UK 2000-01-01  0.7
400       UK 2001-02-04  0.5
800       UK 2002-03-11  0.5
1200      UK 2003-04-15  0.5
1600      UK 2004-05-19  0.5
2000      UK 2005-06-23  0.5
2400      UK 2006-07-28  0.5
2800      UK 2007-09-01  0.5
3200      UK 2008-10-05  0.5
3600      UK 2009-11-09  0.5
4000      UK 2010-12-14  0.5
4400      UK 2012-01-18  0.5
4800      UK 2013-02-21  0.5
5200      UK 2014-03-28  0.5
5600      UK 2015-05-02  0.5
6000      UK 2016-06-05  1.0
6400      UK 2017-07-10  1.0
6800      US 2014-01-27  2.0
7200      US 2015-03-03  3.0
7600      US 2016-04-06  3.0
8000      US 2017-05-11  3.0

回答3:

You could make date the index and then use reindex to expand the dates and ffill to forward-fill the NaNs:

def expand_dates(grp):
    start = grp.index.min()
    end = today
    index = pd.date_range(start, end, freq='D')
    return grp.reindex(index).ffill()

Use groupby/apply to call expand_dates once for each group and concatenate the results:

df = df.groupby('country')['gd'].apply(expand_dates)

Correction: My first answer forward-filled the entire DataFrame as the last step: df = df.ffill(). That is correct only if each country's first gd value is not NaN. If the starting row(s) for a certain country have NaN gd value(s), then forward-filling may contaminate those gd values with values from another country. Yikes. The more robust and correct method would be to forward-fill once for each group as shown by piRSquared. Any performance gain achieved by forward-filling once instead of many times on smaller DataFrames would be minor since the number of ffill calls is limited by the number of countries (a pretty low number) and safe-guarding against a potential bug is far more important than the limited performance gain that is possible.

import numpy as np
import pandas as pd
df = pd.DataFrame({'country': ['US', 'US', 'US', 'UK', 'UK', 'UK'], 'date': ['01-01-2014', '01-01-2015', '01-01-2013', '01-01-2000', '02-01-2001', '01-01-2016'], 'gd': [2.0, 3.0, 0.4, 0.7, 0.5, 1.0]})
df['date'] = pd.to_datetime(df['date'])
today = pd.Timestamp('today')
def expand_dates(grp):
    start = grp.index.min()
    end = today
    index = pd.date_range(start, end, freq='D')
    return grp.reindex(index).ffill()
df = df.set_index('date')
df = df.groupby('country')['gd'].apply(expand_dates)
print(pd.concat([df.head(), df.tail()]))

yields

country            
UK       2000-01-01    0.7
         2000-01-02    0.7
         2000-01-03    0.7
         2000-01-04    0.7
         2000-01-05    0.7
US       2017-07-14    3.0
         2017-07-15    3.0
         2017-07-16    3.0
         2017-07-17    3.0
         2017-07-18    3.0
Name: gd, dtype: float64

来源：https://stackoverflow.com/questions/45176903/fill-in-dates-and-use-previous-values

标签

python

pandas