I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to
A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:
>>> import pandas as pd
>>> source = pd.DataFrame(
{
'Country': ['USA', 'USA', 'Russia', 'USA'],
'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name': ['NY', 'New', 'Spb', 'NY']
}
)
>>> grouped_df = source\
.groupby(['Country','City','Short name'])[['Short name']]\
.count()\
.rename(columns={'Short name':'count'})\
.reset_index()\
.sort_values('count', ascending=False)\
.drop_duplicates(subset=['Country', 'City'])\
.drop('count', axis=1)
>>> print(grouped_df)
Country City Short name
1 USA New-York NY
0 Russia Sankt-Petersburg Spb