Can anyone explain me StandardScaler?

前端 未结 9 509
一整个雨季
一整个雨季 2020-12-04 05:54

I am unable to understand the page of the StandardScaler in the documentation of sklearn.

Can anyone explain this to me in simple terms?

9条回答
  •  醉话见心
    2020-12-04 06:50

    The answers above are great, but I needed a simple example to alleviate some concerns that I have had in the past. I wanted to make sure it was indeed treating each column separately. I am now reassured and can't find what example had caused me concern. All columns ARE scaled separately as described by those above.

    CODE

    import pandas as pd
    import scipy.stats as ss
    from sklearn.preprocessing import StandardScaler
    
    
    data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]
    
    df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')
    
    sc_X = StandardScaler()
    df = sc_X.fit_transform(df)
    
    num_cols = len(df[0,:])
    for i in range(num_cols):
        col = df[:,i]
        col_stats = ss.describe(col)
        print(col_stats)
    

    OUTPUT

    DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
    DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
    DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
    DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
    DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)
    

    NOTE:

    The scipy.stats module is correctly reporting the "sample" variance, which uses (n - 1) in the denominator. The "population" variance would use n in the denominator for the calculation of variance. To understand better, please see the code below that uses scaled data from the first column of the data set above:

    Code

    import scipy.stats as ss
    
    sc_Data = [[-1.34164079], [-0.4472136], [0.4472136], [1.34164079]]
    col_stats = ss.describe([-1.34164079, -0.4472136, 0.4472136, 1.34164079])
    print(col_stats)
    print()
    
    mean_by_hand = 0
    for row in sc_Data:
        for element in row:
            mean_by_hand += element
    mean_by_hand /= 4
    
    variance_by_hand = 0
    for row in sc_Data:
        for element in row:
            variance_by_hand += (mean_by_hand - element)**2
    sample_variance_by_hand = variance_by_hand / 3
    sample_std_dev_by_hand = sample_variance_by_hand ** 0.5
    
    pop_variance_by_hand = variance_by_hand / 4
    pop_std_dev_by_hand = pop_variance_by_hand ** 0.5
    
    print("Sample of Population Calcs:")
    print(mean_by_hand, sample_variance_by_hand, sample_std_dev_by_hand, '\n')
    print("Population Calcs:")
    print(mean_by_hand, pop_variance_by_hand, pop_std_dev_by_hand)
    

    Output

    DescribeResult(nobs=4, minmax=(-1.34164079, 1.34164079), mean=0.0, variance=1.3333333422778562, skewness=0.0, kurtosis=-1.36000000429325)
    
    Sample of Population Calcs:
    0.0 1.3333333422778562 1.1547005422523435
    
    Population Calcs:
    0.0 1.000000006708392 1.000000003354196
    

提交回复
热议问题