I have a pandas dataframe with a column of real values that I want to zscore normalize:
>> a
array([ nan, 0.0767, 0.4383, 0.7866, 0.8091, 0.195
Another alternative solution to this problem is to fill the NaNs in a DataFrame with the column means when calculating the z-score. This will result in the NaNs being calculated as having a z-score of 0, which can then be masked out using notna
on the original df.
You can create a DataFrame of the same dimensions as the original df, containing the z-scores of the original df's values and NaNs in the same places in one line with:
zscore_df = pd.DataFrame(scipy.stats.zscore(df.fillna(df.mean())), index=df.index, columns=df.columns).where(df.notna())
Well the pandas'
versions of mean
and std
will hand the Nan
so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std
):
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df
a zscore
0 NaN NaN
1 0.0767 -1.148329
2 0.4383 0.071478
3 0.7866 1.246419
4 0.8091 1.322320
5 0.1954 -0.747912
6 0.6307 0.720512
7 0.6599 0.819014
8 0.1065 -1.047803
9 0.0508 -1.235699
I am not sure since when this parameter exists, because I have not been working with python for long. But you can simply use the parameter nan_policy = 'omit' and nans are ignored in the calculation:
a = np.array([np.nan, 0.0767, 0.4383, 0.7866, 0.8091, 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
ZScore_a = stats.zscore(a,nan_policy='omit')
print(ZScore_a)
[nan -1.14832945 0.07147776 1.24641928 1.3223199 -0.74791154
0.72051236 0.81901449 -1.0478033 -1.23569949]
You could ignore nans using isnan.
z = a # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})
Zscore a
0 NaN NaN
1 -1.148329 0.0767
2 0.071478 0.4383
3 1.246419 0.7866
4 1.322320 0.8091
5 -0.747912 0.1954
6 0.720512 0.6307
7 0.819014 0.6599
8 -1.047803 0.1065
9 -1.235699 0.0508