问题
Given the dataframe:
df = pd.DataFrame({'col1': ['A', 'A', 'A','B','B'], 'col2': ['type1', 'type2', 'type1', 'type2', 'type1'] , 'hour': ['18:03:30','18:00:48', '18:13:46', '18:11:29', '18:06:31'] })
col1 col2 hour
A type1 18:03:30 # Drop this row as (A type1) already present
A type2 18:00:48
A type1 18:13:46 # keep this row as (A type1) already present.
B type2 18:11:29
B type1 18:06:31
I want to drop duplicates based on col1,col2.
eg.(row(0): A type1, row(2): A type1)
keeping only the row that has the latest hour eg.(18:13:46).
I tried using groupby to return subset based on col1, and drop_duplicates to drop the duplicate in col2. I need to find a way to pass the condition (latest hour)
example code:
for key, grp in df.groupby('col1'):
grp.drop_duplicates(subset='col2', keep="LATEST OF HOUR")
Expected outcome:
col1 col2 hour
A type1 18:03:30
A type2 18:00:48
B type2 18:11:29
B type1 18:06:31
EDIT adding context
my original dataframe is larger, the solution needs to work for also:
col1 col2 other hour
A type1 h 18:03:30 # Drop this row as (A type1) already present
A type2 ss 18:00:48
A type1 ll 18:13:46 # keep this row as (A type1) already present
B type2 mm 18:11:29
B type1 jj 18:06:31
it would still need to drop the column based on the hour
回答1:
df.drop_duplicates(['col1','col2'] , keep = 'last')
回答2:
Following anky_91's comment I solved it like this:
df.sort_values('hour').drop_duplicates(['col1','col2'] , keep = 'last')
This sorts based on the column 'hour' so that you are sure that keep='last' gets the last element
来源:https://stackoverflow.com/questions/57807305/pandas-drop-duplicates-in-cola-keeping-row-based-on-condition-on-colb