Understanding the execution of DataFrame in python

独自空忆成欢 提交于 2020-03-03 07:07:05

问题


I am new to python and i want to understand how the execution takes place in a DataFrame. let's try this with an example from the dataset found in the kaggle.com(Titanic: Machine Learning from Disaster). I wanted to replace the NaN value with the mean() for the respective sex. ie. the NaN value for Men should be replaced by the mean of the mens age and vice versa. now i achieved this by using this line of code

_data['new_age']=_data['new_age'].fillna(_data.groupby('Sex')['Age'].transform('mean'))

my question is, while executing the code, how does the line knows that this particular row belongs to male and the NaN value should be replaced by the male mean() and female value should be replaced by the female mean().


回答1:


It's because of groupby + transform. When you group with an aggregation that returns a scalar per group a normal groupby collapses to a single row for each unique grouping key.

np.random.seed(42)
df = pd.DataFrame({'Sex': list('MFMMFFMMFM'),
                   'Age': np.random.choice([1, 10, 11, 13, np.NaN], 10)},
                   index=list('ABCDEFGHIJ'))
df.groupby('Sex')['Age'].mean()

#Sex
#F    10.5                # One F row
#M    11.5                # One M row
#Name: Age, dtype: float64

Using transform will broadcast this result back to the original index based on the group that row belonged to.

df.groupby('Sex')['Age'].transform('mean')

#A    11.5  # Belonged to M
#B    10.5  # Belonged to F
#C    11.5  # Belonged to M
#D    11.5
#E    10.5
#F    10.5
#G    11.5
#H    11.5
#I    10.5
#J    11.5
#Name: Age, dtype: float64

To make it crystal clear, I'll assign the transformed result back, and now you can see how .fillna gets the correct mean.

df['Sex_mean'] = df.groupby('Sex')['Age'].transform('mean')

  Sex   Age  Sex_mean
A   M  13.0      11.5
B   F   NaN      10.5  # NaN will be filled with 10.5
C   M  11.0      11.5
D   M   NaN      11.5  # NaN will be filled with 11.5
E   F   NaN      10.5  # Nan will be filled with 10.5
F   F  10.0      10.5
G   M  11.0      11.5
H   M  11.0      11.5
I   F  11.0      10.5
J   M   NaN      11.5  # Nan will be filled with 11.5


来源:https://stackoverflow.com/questions/60192232/understanding-the-execution-of-dataframe-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!