问题
What I have:
df
Name |Vehicle
Dave |Car
Mark |Bike
Steve|Car
Dave |
Steve|
I want to drop duplicates from the Name column but only if the corresponding value in Vehicle column is null. I know I can use
df.dropduplicates(subset=['Name'])
with either Keep = either 'First' or 'Last' but what I am looking for is a way to drop duplicates from Name column where the corresponding value of Vehicle column is null. So basically, keep the Name if the Vehicle column is NOT null and drop the rest. If a name does not have a duplicate,then keep that row even if the corresponding value in Vehicle is null.
Many Thanks
回答1:
I think you need chain 2 masks with bitwise AND (&) with Series.notna and Series.duplicated:
m1 = df['Vehicle'].notna()
m2 = ~df['Name'].duplicated()
df1 = df[m1 & m2]
print (df1)
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car
If want these operations separately - first remove all NaNs rows and then remove duplicates for avoid testing duplicates in NaNs rows (if necessary):
df2 = df.dropna(subset=['Vehicle']).drop_duplicates('Name')
print (df2)
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car
回答2:
this will filter out both None and empty values (IF there are any non-None or non-empty values that is), keeping just the first encountered value for Vehicle
import pandas as pd
df = pd.DataFrame({"Name": ["Dave", "Mark", "Steve", "Dave", "Steve"], "Vehicle": ["Car", "Bike", "Car", None, ""]})
res = df.sort_values("Vehicle", ascending=False).groupby("Name")["Vehicle"].first().reset_index()
Output:
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car
来源:https://stackoverflow.com/questions/59532750/drop-duplicate-if-the-value-in-another-column-is-null-pandas