Normalize PCA with scikit-learn when data is split

问题

I have a followup question on: How to normalize with PCA and scikit-learn.

I'm creating an emotion detection system and what I do now is:

Split data over all emotion (distributing data over multiple subsets).
Add all data together (the multiple subsets into 1 set)
Get PCA parameters of combined data (self.pca = RandomizedPCA(n_components=self.n_components, whiten=True).fit(self.data))
Per emotion (per subset), apply PCA to data of that emotion (subset).

I should do the normalization at: step 2) Normalize all combined data, and step 4) normalize the subsets.

Edit

I was wondering if the normalization over all data and the normalization over subset is the same. Now when I tried to simplify my example on suggestion of @BartoszKP I figured out that how I understood the normalization worked, was wrong. The normalization in both cases work in the same way, so this is a valid way to do it, right? (see code)

from sklearn.preprocessing import normalize
from sklearn.decomposition import RandomizedPCA
import numpy as np

data_1 = np.array(([52, 254], [4, 128]), dtype='f')
data_2 = np.array(([39, 213], [123, 7]), dtype='f')
data_combined = np.vstack((data_1, data_2))
#print(data_combined)
"""
Output
[[  52.  254.]
 [   4.  128.]
 [  39.  213.]
 [ 123.    7.]]
"""
#Normalize all data
data_norm = normalize(data_combined)
print(data_norm)
"""
[[ 0.20056452  0.97968054]
 [ 0.03123475  0.99951208]
 [ 0.18010448  0.98364753]
 [ 0.99838448  0.05681863]]
"""

pca = RandomizedPCA(n_components=20, whiten=True)
pca.fit(data_norm)

#Normalize subset of data
data_1_norm = normalize(data_1)
print(data_1_norm)
"""
[[ 0.20056452  0.97968054]
 [ 0.03123475  0.99951208]]
"""
pca.transform(data_1_norm)

回答1:

Yes, as explained in the documentation, what normalize does, is scaling individual samples, independently to others:

Normalization is the process of scaling individual samples to have unit norm.

This is additionally explained in the documentation of the Normalizer class:

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

^{(emphasis mine)}

来源：https://stackoverflow.com/questions/27646915/normalize-pca-with-scikit-learn-when-data-is-split

标签

python

scikit-learn

pca