I am unable to understand the page of the StandardScaler in the documentation of sklearn.
Can anyone explain this to me in simple terms?
The answers above are great, but I needed a simple example to alleviate some concerns that I have had in the past. I wanted to make sure it was indeed treating each column separately. I am now reassured and can't find what example had caused me concern. All columns ARE scaled separately as described by those above.
import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler
data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]
df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')
sc_X = StandardScaler()
df = sc_X.fit_transform(df)
num_cols = len(df[0,:])
for i in range(num_cols):
col = df[:,i]
col_stats = ss.describe(col)
print(col_stats)
DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)
The scipy.stats module is correctly reporting the "sample" variance, which uses (n - 1) in the denominator. The "population" variance would use n in the denominator for the calculation of variance. To understand better, please see the code below that uses scaled data from the first column of the data set above:
import scipy.stats as ss
sc_Data = [[-1.34164079], [-0.4472136], [0.4472136], [1.34164079]]
col_stats = ss.describe([-1.34164079, -0.4472136, 0.4472136, 1.34164079])
print(col_stats)
print()
mean_by_hand = 0
for row in sc_Data:
for element in row:
mean_by_hand += element
mean_by_hand /= 4
variance_by_hand = 0
for row in sc_Data:
for element in row:
variance_by_hand += (mean_by_hand - element)**2
sample_variance_by_hand = variance_by_hand / 3
sample_std_dev_by_hand = sample_variance_by_hand ** 0.5
pop_variance_by_hand = variance_by_hand / 4
pop_std_dev_by_hand = pop_variance_by_hand ** 0.5
print("Sample of Population Calcs:")
print(mean_by_hand, sample_variance_by_hand, sample_std_dev_by_hand, '\n')
print("Population Calcs:")
print(mean_by_hand, pop_variance_by_hand, pop_std_dev_by_hand)
DescribeResult(nobs=4, minmax=(-1.34164079, 1.34164079), mean=0.0, variance=1.3333333422778562, skewness=0.0, kurtosis=-1.36000000429325)
Sample of Population Calcs:
0.0 1.3333333422778562 1.1547005422523435
Population Calcs:
0.0 1.000000006708392 1.000000003354196