问题
I'm using Pandas 0.19.
Considering the following data frame:
FID admin0 admin1 admin2 windspeed population
0 cntry1 state1 city1 60km/h 700
1 cntry1 state1 city1 90km/h 210
2 cntry1 state1 city2 60km/h 100
3 cntry1 state2 city3 60km/h 70
4 cntry1 state2 city4 60km/h 180
5 cntry1 state2 city4 90km/h 370
6 cntry2 state3 city5 60km/h 890
7 cntry2 state3 city6 60km/h 120
8 cntry2 state3 city6 90km/h 420
9 cntry2 state3 city6 120km/h 360
10 cntry2 state4 city7 60km/h 740
How can I create a table like this one?
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
cntry2 state4 city7 740 0 0
I have tried with the following pivot table:
table = pd.pivot_table(df,index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"],fill_value=0)
In general it works great, but unfortunately I am not able to sort the new columns in the right order: the 120km/h column appears before the ones for 60km/h and 90km/h. How can I specify the order of the new columns?
Moreover, as a second step I need to add subtotals both for admin0 and admin1. Ideally, the table I need should be like this:
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
SUM state1 800 210 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
SUM state2 250 370 0
SUM cntry1 1050 580 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
SUM state3 1010 420 360
cntry2 state4 city7 740 0 0
SUM state4 740 0 0
SUM cntry2 1750 420 360
SUM ALL 2800 1000 360
回答1:
Solution with subtotals and MultiIndex.from_arrays. Last concat and all Dataframes
, sort_index and add all sum
:
#replace km/h and convert to int
df.windspeed = df.windspeed.str.replace('km/h','').astype(int)
print (df)
FID admin0 admin1 admin2 windspeed population
0 0 cntry1 state1 city1 60 700
1 1 cntry1 state1 city1 90 210
2 2 cntry1 state1 city2 60 100
3 3 cntry1 state2 city3 60 70
4 4 cntry1 state2 city4 60 180
5 5 cntry1 state2 city4 90 370
6 6 cntry2 state3 city5 60 890
7 7 cntry2 state3 city6 60 120
8 8 cntry2 state3 city6 90 420
9 9 cntry2 state3 city6 120 360
10 10 cntry2 state4 city7 60 740
#pivoting
table = pd.pivot_table(df,
index=["admin0","admin1","admin2"],
columns=["windspeed"],
values=["population"],
fill_value=0)
print (table)
population
windspeed 60 90 120
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
#groupby and create sum dataframe by levels 0,1
df1 = table.groupby(level=[0,1]).sum()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800 210 0
state2_sum 250 370 0
cntry2 state3_sum 1010 420 360
state4_sum 740 0 0
df2 = table.groupby(level=0).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.values + '_sum',
len(df2.index) * [''],
len(df2.index) * ['']])
print (df2)
population
windspeed 60 90 120
cntry1_sum 1050 580 0
cntry2_sum 1750 420 360
#concat all dataframes together, sort index
df = pd.concat([table, df1, df2]).sort_index(level=[0])
#add km/h to second level in columns
df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0),
df.columns.get_level_values(1).astype(str) + 'km/h'])
#add all sum
df.loc[('All_sum','','')] = table.sum().values
print (df)
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state1_sum 800 210 0
state2 city3 70 0 0
city4 180 370 0
state2_sum 250 370 0
cntry1_sum 1050 580 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state3_sum 1010 420 360
state4 city7 740 0 0
state4_sum 740 0 0
cntry2_sum 1750 420 360
All_sum 2800 1000 360
EDIT by comment:
def f(x):
print (x)
if (len(x) > 1):
return x.sum()
df1 = table.groupby(level=[0,1]).apply(f).dropna(how='all')
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800.0 210.0 0.0
state2_sum 250.0 370.0 0.0
cntry2 state3_sum 1010.0 420.0 360.0
回答2:
you can do it using reindex() method and custom sorting:
In [26]: table
Out[26]:
population
windspeed 120km/h 60km/h 90km/h
admin0 admin1 admin2
cntry1 state1 city1 0 700 210
city2 0 100 0
state2 city3 0 70 0
city4 0 180 370
cntry2 state3 city5 0 890 0
city6 360 120 420
state4 city7 0 740 0
In [27]: cols = sorted(table.columns.tolist(), key=lambda x: int(x[1].replace('km/h','')))
In [28]: cols
Out[28]: [('population', '60km/h'), ('population', '90km/h'), ('population', '120km/h')]
In [29]: table = table.reindex(columns=cols)
In [30]: table
Out[30]:
population
windspeed 60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
来源:https://stackoverflow.com/questions/39955336/pandas-pivot-table-columns-order-and-subtotals