Used groupby to select most recent data, want to append a column that returns the date of the data

倾然丶 夕夏残阳落幕 提交于 2019-12-24 07:49:31

问题


I originally had a dataframe that looked like this:

                                  industry    population %of rural land
        country       date        
        Australia     2017-01-01  NaN         NaN        NaN
                      2016-01-01  24.327571   18.898304  12
                      2015-01-01  25.396251   18.835267  12
                      2014-01-01  27.277007   18.834835  13
        United States 2017-01-01  NaN         NaN        NaN
                      2016-01-01  NaN         19.028231  NaN
                      2015-01-01  20.027274   19.212860  NaN
                      2014-01-01  20.867359   19.379071  NaN

I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset:

df = df.groupby(level=0).first()

               industry  population  %of rural land
country                             
Australia      24.327571   18.898304 12
United States  20.027274   19.028231 NaN

Is there any way to add a column that shows the year of the data as well? and in the case where the year is different for the same country to return the oldest year of the data in the new data frame? So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this:

               year      industry  population  %of rural land
country                             
Australia      2016      24.327571   18.898304 12
United States  2015      20.027274   19.028231 NaN

回答1:


I think you need for first year of non NaNs rows create helper Series by dropna and then :

s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

Another solution with add NaNs to date column and last get years by dt.year:

df1 = (df.reset_index(level=1)
        .assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
        .groupby(level=0).first()
        .assign(date=lambda x: x['date'].dt.year)
        .rename(columns={'date':'year'}))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

EDIT:

def f(x):
    #check NaNs
    m = x.isnull()
    #remove all NaNs columns 
    m = m.loc[:, ~m.all()]
    #first index value of non NaNs rows
    m = m[~m.any(1)].index[0][1].year
    return (m)

s = df.groupby(level=0).apply(f)
print (s)
country
Australia        2016
United States    2015
dtype: int64

df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population  %of rural land
country                                                   
Australia      2016  24.327571   18.898304            12.0
United States  2015  20.027274   19.028231             NaN


来源:https://stackoverflow.com/questions/47657932/used-groupby-to-select-most-recent-data-want-to-append-a-column-that-returns-th

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!