问题
I have this dataframe:
ID key
0 1 A
1 1 B
2 2 C
3 3 D
4 3 E
5 3 E
I want to create additional key columns -as necessary- to store the data in the key column when there are duplicate IDs
This is a snippet of the output:
ID key key2
0 1 A B # Note: ID#1 appeared twice in the dataframe, so the key value "B"
# associated with the duplicate ID will be stored in the new column "key2"
The complete output should like the following:
ID key key2 key3
0 1 A B NaN
1 2 C NaN NaN
2 3 D E E # The ID#3 has repeated three times. The key of
# of the second repeat "E" will be stored under the "key2" column
# and the third repeat "E" will be stored in the new column "key3"
Any suggestion or idea how should I approach this problem?
Thanks,
回答1:
Check out groupby and apply. Their respective docs are here and here. You can unstack (docs) the extra level of the MultiIndex that is created.
df.groupby('ID')['key'].apply(
lambda s: pd.Series(s.values, index=['key_%s' % i for i in range(s.shape[0])])
).unstack(-1)
outputs
key_0 key_1 key_2
ID
1 A B None
2 C None None
3 D E E
If you want ID as a column, you can call reset_index on this DataFrame.
回答2:
You can use cumcount with pivot_table:
df['cols'] = 'key' + df.groupby('ID').cumcount().astype(str)
print (df.pivot_table(index='ID', columns='cols', values='key', aggfunc=''.join))
cols key0 key1 key2
ID
1 A B None
2 C None None
3 D E E
来源:https://stackoverflow.com/questions/38733732/how-to-create-new-columns-to-store-the-data-of-the-duplicate-id-column