Normalize PCA with scikit-learn when data is split

余生长醉 提交于 2020-01-04 06:39:08

问题


I have a followup question on: How to normalize with PCA and scikit-learn.

I'm creating an emotion detection system and what I do now is:

  1. Split data over all emotion (distributing data over multiple subsets).
  2. Add all data together (the multiple subsets into 1 set)
  3. Get PCA parameters of combined data (self.pca = RandomizedPCA(n_components=self.n_components, whiten=True).fit(self.data))
  4. Per emotion (per subset), apply PCA to data of that emotion (subset).

I should do the normalization at: step 2) Normalize all combined data, and step 4) normalize the subsets.

Edit

I was wondering if the normalization over all data and the normalization over subset is the same. Now when I tried to simplify my example on suggestion of @BartoszKP I figured out that how I understood the normalization worked, was wrong. The normalization in both cases work in the same way, so this is a valid way to do it, right? (see code)

from sklearn.preprocessing import normalize
from sklearn.decomposition import RandomizedPCA
import numpy as np

data_1 = np.array(([52, 254], [4, 128]), dtype='f')
data_2 = np.array(([39, 213], [123, 7]), dtype='f')
data_combined = np.vstack((data_1, data_2))
#print(data_combined)
"""
Output
[[  52.  254.]
 [   4.  128.]
 [  39.  213.]
 [ 123.    7.]]
"""
#Normalize all data
data_norm = normalize(data_combined)
print(data_norm)
"""
[[ 0.20056452  0.97968054]
 [ 0.03123475  0.99951208]
 [ 0.18010448  0.98364753]
 [ 0.99838448  0.05681863]]
"""

pca = RandomizedPCA(n_components=20, whiten=True)
pca.fit(data_norm)

#Normalize subset of data
data_1_norm = normalize(data_1)
print(data_1_norm)
"""
[[ 0.20056452  0.97968054]
 [ 0.03123475  0.99951208]]
"""
pca.transform(data_1_norm)

回答1:


Yes, as explained in the documentation, what normalize does, is scaling individual samples, independently to others:

Normalization is the process of scaling individual samples to have unit norm.

This is additionally explained in the documentation of the Normalizer class:

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

(emphasis mine)



来源:https://stackoverflow.com/questions/27646915/normalize-pca-with-scikit-learn-when-data-is-split

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!