问题
I originally had a dataframe that looked like this:
industry population %of rural land
country date
Australia 2017-01-01 NaN NaN NaN
2016-01-01 24.327571 18.898304 12
2015-01-01 25.396251 18.835267 12
2014-01-01 27.277007 18.834835 13
United States 2017-01-01 NaN NaN NaN
2016-01-01 NaN 19.028231 NaN
2015-01-01 20.027274 19.212860 NaN
2014-01-01 20.867359 19.379071 NaN
I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset:
df = df.groupby(level=0).first()
industry population %of rural land
country
Australia 24.327571 18.898304 12
United States 20.027274 19.028231 NaN
Is there any way to add a column that shows the year of the data as well? and in the case where the year is different for the same country to return the oldest year of the data in the new data frame? So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this:
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12
United States 2015 20.027274 19.028231 NaN
回答1:
I think you need for first
year of non NaN
s rows create helper Series
by dropna and then :
s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
Another solution with add NaNs
to date
column and last get years by dt.year:
df1 = (df.reset_index(level=1)
.assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
.groupby(level=0).first()
.assign(date=lambda x: x['date'].dt.year)
.rename(columns={'date':'year'}))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
EDIT:
def f(x):
#check NaNs
m = x.isnull()
#remove all NaNs columns
m = m.loc[:, ~m.all()]
#first index value of non NaNs rows
m = m[~m.any(1)].index[0][1].year
return (m)
s = df.groupby(level=0).apply(f)
print (s)
country
Australia 2016
United States 2015
dtype: int64
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12.0
United States 2015 20.027274 19.028231 NaN
来源:https://stackoverflow.com/questions/47657932/used-groupby-to-select-most-recent-data-want-to-append-a-column-that-returns-th