GroupBy pandas DataFrame and select most common value

后端 未结 10 1856
梦谈多话
梦谈多话 2020-11-22 07:59

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to

10条回答
  •  旧巷少年郎
    2020-11-22 08:45

    A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:

    >>> import pandas as pd
    >>> source = pd.DataFrame(
            {
                'Country': ['USA', 'USA', 'Russia', 'USA'], 
                'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                'Short name': ['NY', 'New', 'Spb', 'NY']
            }
        )
    >>> grouped_df = source\
            .groupby(['Country','City','Short name'])[['Short name']]\
            .count()\
            .rename(columns={'Short name':'count'})\
            .reset_index()\
            .sort_values('count', ascending=False)\
            .drop_duplicates(subset=['Country', 'City'])\
            .drop('count', axis=1)
    >>> print(grouped_df)
      Country              City Short name
    1     USA          New-York         NY
    0  Russia  Sankt-Petersburg        Spb
    

提交回复
热议问题