The result of dataframe.mean() is incorrect

僤鯓⒐⒋嵵緔 提交于 2020-01-24 09:35:32

问题


I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value. When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.

Why is the output of the mean incorrect?

The code is the following:

df = pd.DataFrame(
    np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]), 
    columns=['a', 'b', 'c', 'd']
)
df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()

median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()

The output:

df
Out[1]: 
   a  b  c    d
0  A  1  2    3
1  A  4  5  nan
2  A  7  8    9
3  B  3  2  nan
4  B  5  6  nan
5  B  5  6  nan
mean1
Out[2]: 86.0

mean2
Out[3]: 88.66666666666667

median1
Out[4]: 5.0

median2
Out[5]: 6.0

It is obvious that the output of the mean is incorrect.

Thanks.


回答1:


Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.


>>> df[df.a == 'B'].c
3    2
4    6
5    6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667

If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.


df = pd.DataFrame(
    [['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
    columns=['a', 'b', 'c', 'd']
)

df[df.a == 'B'].c.mean()

4.666666666666667

In [17]: df.dtypes
Out[17]:
a     object
b      int64
c      int64
d    float64
dtype: object

I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.



来源:https://stackoverflow.com/questions/55955242/the-result-of-dataframe-mean-is-incorrect

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!