how to zscore normalize pandas column with nans?

后端 未结 4 690
不思量自难忘°
不思量自难忘° 2020-12-06 00:53

I have a pandas dataframe with a column of real values that I want to zscore normalize:

>> a
array([    nan,  0.0767,  0.4383,  0.7866,  0.8091,  0.195         


        
相关标签:
4条回答
  • 2020-12-06 01:01

    Another alternative solution to this problem is to fill the NaNs in a DataFrame with the column means when calculating the z-score. This will result in the NaNs being calculated as having a z-score of 0, which can then be masked out using notna on the original df.

    You can create a DataFrame of the same dimensions as the original df, containing the z-scores of the original df's values and NaNs in the same places in one line with:

    zscore_df = pd.DataFrame(scipy.stats.zscore(df.fillna(df.mean())), index=df.index, columns=df.columns).where(df.notna())
    
    0 讨论(0)
  • 2020-12-06 01:04

    Well the pandas' versions of mean and std will hand the Nan so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std):

    df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
    print df
    
            a    zscore
    0     NaN       NaN
    1  0.0767 -1.148329
    2  0.4383  0.071478
    3  0.7866  1.246419
    4  0.8091  1.322320
    5  0.1954 -0.747912
    6  0.6307  0.720512
    7  0.6599  0.819014
    8  0.1065 -1.047803
    9  0.0508 -1.235699
    
    0 讨论(0)
  • 2020-12-06 01:07

    I am not sure since when this parameter exists, because I have not been working with python for long. But you can simply use the parameter nan_policy = 'omit' and nans are ignored in the calculation:

    a = np.array([np.nan,  0.0767,  0.4383,  0.7866,  0.8091,  0.1954,  0.6307, 0.6599, 0.1065,  0.0508])
    ZScore_a = stats.zscore(a,nan_policy='omit')
    
    print(ZScore_a)
    [nan -1.14832945  0.07147776  1.24641928  1.3223199  -0.74791154
    0.72051236  0.81901449 -1.0478033  -1.23569949]
    
    0 讨论(0)
  • 2020-12-06 01:13

    You could ignore nans using isnan.

    z = a                    # initialise array for zscores
    z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
    pandas.DataFrame({'a':a,'Zscore':z})
    
         Zscore       a
    0       NaN     NaN
    1 -1.148329  0.0767
    2  0.071478  0.4383
    3  1.246419  0.7866
    4  1.322320  0.8091
    5 -0.747912  0.1954
    6  0.720512  0.6307
    7  0.819014  0.6599
    8 -1.047803  0.1065
    9 -1.235699  0.0508
    
    0 讨论(0)
提交回复
热议问题