GroupBy pandas DataFrame and select most common value

后端 未结 10 1821
梦谈多话
梦谈多话 2020-11-22 07:59

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to

10条回答
  •  执念已碎
    2020-11-22 08:47

    Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don't have the mode. Example:

    >>> import pandas as pd
    >>> df = pd.DataFrame(
            {
                'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
                'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
                'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
            }
        )
    

    If you compute like @HYRY you obtain:

    >>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
            total  bla
    client            
    A           4   30
    B           4   40
    C           1   10
    D           3   30
    E           2   20
    

    Which is clearly wrong (see the A value that should be 1 and not 4) because it can't handle with unique values.

    Thus, the other solution is correct:

    >>> import scipy.stats
    >>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
            total  bla
    client            
    A           1   10
    B           4   40
    C           1   10
    D           3   30
    E           2   20
    

提交回复
热议问题