问题
I've got a table showing the accumulated number of hours (dataframe values) different specialists (ID) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4'] like this:
Input:
ID Task1 Task2 Task3 Task4
0 10 1 3 4 6
1 11 1 3 4 5
2 12 1 3 4 6
Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this:
Desired output:
value 1 3 4 5 6
ID
10 Task1 Task2 Task3 Task3 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
With this particular dataframe, I've managed to produce the desired output using pd.melt(), pd.pivot() and pd.fillna() like this (complete snippet with sample data further down):
What I have tried:
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
The problem is that this approach is not very robust, in that it can easily collapse with a dataset that would render (I think) duplicate column names. Here's an example where that happens by just changing Task3 for ID=0 from 5 to 4:
Code 1
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
Code 1 - Error:
ValueError: Index contains duplicate entries, cannot reshape
And according to the docs, pd.pivot_table() is a:
generalization of pivot that can handle duplicate values for one index/column pair.
So I was hoping that pd.pivot_table() would be better suited for this case. Alas, this triggers:
DataError: No numeric types to aggregate
Does anyone know if it's at all possible to obtain a robust way of handling these errors? Am I perhaps only using pd.pivot_table() the wrong way? I've also tried to include aggfunc=None.
I'm at a loss here, so any suggestions would be great! Although I'm hoping for an approach with df.pivot or pd.pivot_table and / or the shortest approach possible.
Complete working code example:
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 5, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
Complete example where both df.pivot and pd.pivot_table fail:
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
# df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.pivot_table(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
回答1:
You can do this also using pd.crosstab:
dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)
print(df_out)
Output:
val 1 3 4 5 6
ID
10 Task1 Task1 Task2 Task2 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
Or changing the aggfunc to 'last':
dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)
df_out
Output:
val 1 3 4 5 6
ID
10 Task1 Task1 Task3 Task3 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
回答2:
I'm pretty sure that this is not the best way to do this but it is one way.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df1 = pd.melt(df, id_vars=['ID'], value_vars=df.columns[1:])
df1['value'] = df1['value'].astype(int)
df1.set_index(['ID','value'], inplace=True)
df_max_val = df.set_index('ID').max().max()
ids = df['ID'].tolist()*df_max_val
values = list(np.array([[i]*len(set(ids)) for i in range(1, df_max_val+1)]).flatten())
df2 = pd.DataFrame({'ID':ids,
'value':values})
df2.set_index(['ID','value'], inplace=True)
df3 = df2.merge(df1, left_index=True, right_index=True, how='outer')
df3 = df3.reset_index().drop_duplicates(subset=['ID','value'], keep='last')
df3 = pd.concat([df3[df3['ID']==i].fillna(method='ffill') for i in df3['ID'].unique()])
df3 = df3.pivot(index='ID', columns='value', values='variable')
来源:https://stackoverflow.com/questions/65974776/how-to-handle-valueerror-index-contains-duplicate-entries-using-df-pivot-or-pd