I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to
Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don't have the mode. Example:
>>> import pandas as pd
>>> df = pd.DataFrame(
{
'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'],
'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4],
'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
}
)
If you compute like @HYRY you obtain:
>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
total bla
client
A 4 30
B 4 40
C 1 10
D 3 30
E 2 20
Which is clearly wrong (see the A value that should be 1 and not 4) because it can't handle with unique values.
Thus, the other solution is correct:
>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
total bla
client
A 1 10
B 4 40
C 1 10
D 3 30
E 2 20