PCA 的定义方式有很多,其中最常见的有三个:
其一,PCA 可以理解为降维法,在保留尽可能多的离散程度信息的基础上减少变量的个数,消除变量之间的线性相关性;
其二,PCA 可以理解为向其他方向上的正交投影,使得投影点的方差最大(Hotelling,1933);
其三,它也可以理解为正交投影,使得复原损失最小,这一损失通过数据点与估计点间平方距离的平均值来刻画(Pearson,1901).
下面我们分别考虑这三种定义方式.
1. 降维法
这里的"维"指的便是变量的个数,在记录数据时,例如采集一个人的信息,需要收集其身高、体重、胸围等数据,这里的身高、体重和胸围即是"变量";每个人的数据,例如(173(cm), 65(kg), 887(mm)),称为一个样本,或数据点.
PCA 可以这么理解:一方面,在保留尽可能多的离散程度信息的情况下减少变量的个数;另一方面,消除变量之间的线性相关性(在上述例子中,身高、体重、胸围之间显然具有某种正相关性). 至于为什么要这样做,这就涉及到 PCA 的来历,可参见:A Tutorial on Principal Component Analysis(译).
离散程度信息可以通过变量的方差来刻画,方差越大,含有的信息越多;
变量的线性相关性可以通过协方差的绝对值来刻画,绝对值越大,相关性越强,协方差为零时线性无关;
PCA 的思路是,对原有变量进行线性组合得到新变量,使得新变量的方差尽可能大,不同变量间的协方差为零.
下面来看详细 推导过程:
设 X \small X X 为 m m m 维随机变量,X = ( x 1 x 2 ⋮ x m ) X=\begin{pmatrix}x_1\\x_2\\\vdots\\x_m\end{pmatrix} X = ⎝ ⎜ ⎜ ⎜ ⎛ x 1 x 2 ⋮ x m ⎠ ⎟ ⎟ ⎟ ⎞ 对其作变换如下:P X = Y PX=Y P X = Y 其中 P \small P P 为方阵, P = ( p i j ) m × m = [ p 1 T p 2 T ⋮ p m T ] P=(p_{ij})_{m\times m}=\begin{bmatrix}p_1^T\\p_2^T\\ \vdots \\p_m^T\end{bmatrix} P = ( p i j ) m × m = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 T p 2 T ⋮ p m T ⎦ ⎥ ⎥ ⎥ ⎤ 则[ y 1 y 2 ⋮ y m ] = Y = P X = [ p 11 x 1 + p 12 x 2 + ⋯ + p 1 m x m p 21 x 1 + p 22 x 2 + ⋯ + p 2 m x m ⋮ p m 1 x 1 + p m 2 x 2 + ⋯ + p m m x m ] = [ p 1 T X p 2 T X ⋮ p m T X ] \begin{bmatrix}y_1\\ y_2\\ \vdots \\ y_m\end{bmatrix}=Y=PX=\begin{bmatrix}p_{11}x_1+p_{12}x_2+\cdots+p_{1m}x_m \\ p_{21}x_1+p_{22}x_2+\cdots+p_{2m}x_m \\ \vdots \\ p_{m1}x_1+p_{m2}x_2+\cdots+p_{mm}x_m \end{bmatrix}=\begin{bmatrix}p_1^TX\\p_2^TX\\ \vdots \\p_m^TX\end{bmatrix} ⎣ ⎢ ⎢ ⎢ ⎡ y 1 y 2 ⋮ y m ⎦ ⎥ ⎥ ⎥ ⎤ = Y = P X = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 1 x 1 + p 1 2 x 2 + ⋯ + p 1 m x m p 2 1 x 1 + p 2 2 x 2 + ⋯ + p 2 m x m ⋮ p m 1 x 1 + p m 2 x 2 + ⋯ + p m m x m ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 T X p 2 T X ⋮ p m T X ⎦ ⎥ ⎥ ⎥ ⎤ 可以看到,新变量 y i y_i y i 是原变量的线性组合.V a r ( y i ) = E [ y i − E ( y i ) ] 2 = E [ ( p i T X − E ( p i T X ) ) ( p i T X − E ( p i T X ) ) T ] = p i T E [ ( X − E ( X ) ) ( X − E ( X ) ) T ] p i = p i T C X p i C o v ( y i , y j ) = p i T C X p j , i , j = 1 , 2 , ⋯ , m \begin{aligned}Var(y_i)&=E[y_i-E(y_i)]^2\\&=E[(p_i^TX-E(p_i^TX))(p_i^TX-E(p_i^TX))^T]\\&=p_i^TE[(X-E(X))(X-E(X))^T]p_i\\&=p_i^TC_Xp_i\\Cov(y_i,y_j)&=p_i^TC_Xp_j,\,\,i,j=1,2,\cdots,m\end{aligned} V a r ( y i ) C o v ( y i , y j ) = E [ y i − E ( y i ) ] 2 = E [ ( p i T X − E ( p i T X ) ) ( p i T X − E ( p i T X ) ) T ] = p i T E [ ( X − E ( X ) ) ( X − E ( X ) ) T ] p i = p i T C X p i = p i T C X p j , i , j = 1 , 2 , ⋯ , m 要使 V a r ( y i ) \small Var(y_i) V a r ( y i ) 尽可能地大,这一点很容易做到,只需按比例缩放 p i p_i p i 即可. 不过这样做没有什么意义,也不是我们想要的,因此需要对 p i p_i p i 做些限制:设 p i p_i p i 为单位向量,即 ∥ p i ∥ 2 = p i T p i = 1 \small \Vert p_i \Vert^2=p_i^Tp_i=1 ∥ p i ∥ 2 = p i T p i = 1 . 我们的目标是合理选择 p i p_i p i ,使得 V a r ( y i ) \small Var(y_i) V a r ( y i ) 尽量地大,同时满足 C o v ( y i , y j ) = 0 , i ≠ j \small Cov(y_i,y_j)=0,i\neq j C o v ( y i , y j ) = 0 , i = j .
先做些准备工作,因为 C X \small C_X C X 是实对称矩阵且正定,所以其特征值均为正数,且存在某正交矩阵 U \small U U ,使得 U T C X U = D , D = d i a g ( λ 1 , λ 2 , ⋯ , λ m ) \small U^TC_XU=D,\,\,D=diag(\lambda_1,\lambda_2,\cdots,\lambda_m) U T C X U = D , D = d i a g ( λ 1 , λ 2 , ⋯ , λ m ) ,其中 λ 1 ≥ λ 2 ≥ ⋯ ≥ λ m > 0 , U = ( u 1 u 2 ⋯ u m ) , u i \lambda_1\geq\lambda_2\geq\cdots\geq\lambda_m>0,\,\, U=(u_1\,u_2\,\cdots\,u_m),\,\, u_i λ 1 ≥ λ 2 ≥ ⋯ ≥ λ m > 0 , U = ( u 1 u 2 ⋯ u m ) , u i 为 λ i \lambda_i λ i 的特征向量. 所以 C X \small C_X C X 可以表示为 C X = U D U T C_X=UDU^T C X = U D U T 首先,要使 V a r ( y 1 ) \small Var(y_1) V a r ( y 1 ) 尽可能地大,V a r ( y 1 ) = p 1 T C X p 1 = p 1 T U D U T p 1 Var(y_1)=p_1^TC_Xp_1=p_1^TUDU^Tp_1 V a r ( y 1 ) = p 1 T C X p 1 = p 1 T U D U T p 1 记 z 1 = U T p 1 = ( z 11 , z 12 , ⋯ , z 1 m ) T z_1=U^Tp_1=(z_{11},z_{12},\cdots,z_{1m})^T z 1 = U T p 1 = ( z 1 1 , z 1 2 , ⋯ , z 1 m ) T ,则 ∥ z 1 ∥ 2 = z 11 2 + z 12 2 + ⋯ + z 1 m 2 = z 1 T z 1 = p 1 T U U T p 1 = p 1 T p 1 = 1 \Vert z_1 \Vert^2=z_{11}^2+z_{12}^2+\cdots+z_{1m}^2=z_1^Tz_1=p_1^TUU^Tp_1=p_1^Tp_1=1 ∥ z 1 ∥ 2 = z 1 1 2 + z 1 2 2 + ⋯ + z 1 m 2 = z 1 T z 1 = p 1 T U U T p 1 = p 1 T p 1 = 1 V a r ( y 1 ) = z 1 T D z 1 = z 11 2 λ 1 + z 12 2 λ 2 + ⋯ + z 1 m 2 λ m ≤ z 11 2 λ 1 + z 12 2 λ 1 + ⋯ + z 1 m 2 λ 1 = λ 1 \begin{aligned}Var(y_1)&=z_1^TDz_1\\&=z_{11}^2\lambda_1+z_{12}^2\lambda_2+\cdots+z_{1m}^2\lambda_m\\&\leq z_{11}^2\lambda_1+z_{12}^2\lambda_1+\cdots+z_{1m}^2\lambda_1\\&=\lambda_1\end{aligned} V a r ( y 1 ) = z 1 T D z 1 = z 1 1 2 λ 1 + z 1 2 2 λ 2 + ⋯ + z 1 m 2 λ m ≤ z 1 1 2 λ 1 + z 1 2 2 λ 1 + ⋯ + z 1 m 2 λ 1 = λ 1 取 z 1 = ( 1 , 0 , ⋯ , 0 ) T z_1=(1,0,\cdots,0)^T z 1 = ( 1 , 0 , ⋯ , 0 ) T ,等式成立,此时 V a r ( y 1 ) \small Var(y_1) V a r ( y 1 ) 取最大值 λ 1 , p 1 = U z 1 = u 1 \lambda_1,\,\,p_1=Uz_1=u_1 λ 1 , p 1 = U z 1 = u 1 .
然后,考虑 p 2 p_2 p 2 ,需满足:m a x V a r ( y 2 ) max\,\,Var(y_2) m a x V a r ( y 2 ) s . t . C o v ( y 1 , y 2 ) = 0 s.t.\, Cov(y_1,y_2)=0 s . t . C o v ( y 1 , y 2 ) = 0 即 m a x p 2 T C X p 2 max\,\,p_2^TC_Xp_2 m a x p 2 T C X p 2 s . t . p 2 T C X p 1 = 0 s.t.\,\,p_2^TC_Xp_1=0 s . t . p 2 T C X p 1 = 0 同样记 z 2 = U T p 2 = ( z 21 , z 22 , ⋯ , z 2 m ) T z_2=U^Tp_2=(z_{21},z_{22},\cdots,z_{2m})^T z 2 = U T p 2 = ( z 2 1 , z 2 2 , ⋯ , z 2 m ) T ,则p 2 T C X p 1 = p 2 T U D U T p 1 = z 2 T D z 1 = λ 1 z 21 = 0 p_2^TC_Xp_1=p_2^TUDU^Tp_1=z_2^TDz_1=\lambda_1z_{21}=0 p 2 T C X p 1 = p 2 T U D U T p 1 = z 2 T D z 1 = λ 1 z 2 1 = 0 所以 z 21 = 0 z_{21}=0 z 2 1 = 0 ,则 ∥ z 2 ∥ 2 = 0 + z 22 2 + ⋯ + z 2 m 2 = z 2 T z 2 = p 2 T U U T p 2 = p 2 T p 2 = 1 \Vert z_2 \Vert^2=0+z_{22}^2+\cdots+z_{2m}^2=z_2^Tz_2=p_2^TUU^Tp_2=p_2^Tp_2=1 ∥ z 2 ∥ 2 = 0 + z 2 2 2 + ⋯ + z 2 m 2 = z 2 T z 2 = p 2 T U U T p 2 = p 2 T p 2 = 1 V a r ( y 2 ) = z 22 2 λ 2 + z 23 2 λ 3 + ⋯ + z 2 m 2 λ m ≤ z 22 2 λ 2 + z 23 2 λ 2 + ⋯ + z 2 m 2 λ 2 = λ 2 \begin{aligned}Var(y_2)&=z_{
22}^2\lambda_2+z_{
23}^2\lambda_3+\cdots+z_{2m}^2\lambda_m\\&\leq z_{
22}^2\lambda_2+z_{
23}^2\lambda_2+\cdots+z_{2m}^2\lambda_2\\&=\lambda_2\end{aligned} V a r ( y 2 ) = z 2 2 2 λ 2 + z 2 3 2 λ 3 + ⋯ + z 2 m 2 λ m ≤ z 2 2 2 λ 2 + z 2 3 2 λ 2 + ⋯ + z 2 m 2 λ 2 = λ 2 取 z 2 = ( 0 , 1 , 0 , ⋯ , 0 ) T z_2=(0,1,0,\cdots,0)^T z 2 = ( 0 , 1 , 0 , ⋯ , 0 ) T ,等式成立,此时 V a r ( y 2 ) \small Var(y_2) V a r ( y 2 ) 取最大值 λ 2 \lambda_2 λ 2 且满足 C o v ( y 1 , y 2 ) = 0 , p 2 = U z 2 = u 2 \small Cov(y_1,y_2)=0,\,p_2=Uz_2=u_2 C o v ( y 1 , y 2 ) = 0 , p 2 = U z 2 = u 2 .
依次递推下去,求 p i p_i p i ,需满足:m a x V a r ( y i ) max \,\,Var(y_i) m a x V a r ( y i ) s . t . C o v ( y j , y i ) = 0 , j = 1 , 2 , ⋯ , i − 1 s.t.\,Cov(y_j,y_i)=0,\,\,j=1,2,\cdots,i-1 s . t . C o v ( y j , y i ) = 0 , j = 1 , 2 , ⋯ , i − 1 解得 V a r ( y i ) \small Var(y_i) V a r ( y i ) 的最大值为 λ i \lambda_i λ i ,当 p i = u i p_i=u_i p i = u i 时取得最大值.
综合上述,p i = u i p_i=u_i p i = u i ,即 P = [ p 1 T p 2 T ⋮ p m T ] = [ u 1 T u 2 T ⋮ u m T ] = U T , Y = U T X P=\begin{bmatrix}p_1^T\\p_2^T\\ \vdots \\p_m^T\end{bmatrix}=\begin{bmatrix}u_1^T\\u_2^T\\ \vdots \\u_m^T\end{bmatrix}=U^T, Y=U^TX P = ⎣ ⎢ ⎢ ⎢ ⎡ p 1 T p 2 T ⋮ p m T ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ u 1 T u 2 T ⋮ u m T ⎦ ⎥ ⎥ ⎥ ⎤ = U T , Y = U T X 此时C Y = [ C o v ( y 1 , y 1 ) C o v ( y 1 , y 2 ) ⋯ C o v ( y 1 , y m ) C o v ( y 2 , y 1 ) C o v ( y 2 , y 2 ) ⋯ C o v ( y 2 , y m ) ⋮ ⋮ ⋱ ⋮ C o v ( y m , y 1 ) C o v ( y m , y 2 ) ⋯ C o v ( y m , y m ) ] = [ λ 1 0 ⋯ 0 0 λ 2 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ λ m ] = D \begin{aligned}C_Y&=\begin{bmatrix}Cov(y_1,y_1) & Cov(y_1,y_2) & \cdots & Cov(y_1,y_m)\\Cov(y_2,y_1) & Cov(y_2,y_2) & \cdots & Cov(y_2,y_m)\\\vdots & \vdots & \ddots & \vdots\\Cov(y_m,y_1) & Cov(y_m,y_2) & \cdots & Cov(y_m,y_m)\end{bmatrix}\\&=\begin{bmatrix}\lambda_1&0&\cdots&0\\0&\lambda_2&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\lambda_m\end{bmatrix}\\&=D\end{aligned} C Y = ⎣ ⎢ ⎢ ⎢ ⎡ C o v ( y 1 , y 1 ) C o v ( y 2 , y 1 ) ⋮ C o v ( y m , y 1 ) C o v ( y 1 , y 2 ) C o v ( y 2 , y 2 ) ⋮ C o v ( y m , y 2 ) ⋯ ⋯ ⋱ ⋯ C o v ( y 1 , y m ) C o v ( y 2 , y m ) ⋮ C o v ( y m , y m ) ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ λ 1 0 ⋮ 0 0 λ 2 ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ λ m ⎦ ⎥ ⎥ ⎥ ⎤ = D 即 C Y \small C_Y C Y 为对角阵,对角线元素为 C X \small C_X C X 的特征值(由大到小排列).
也可以通过 拉格朗日乘数法 来求解 p i p_i p i ,但不如此方法直观,感兴趣的读者可查阅相关资料或自己求解.
2. 最大方差解释
PCA 的目的是重新找一组基,这些基向量之间互相垂直,且数据点在基向量方向上投影的方差最大 . 记要找 标准正交基 为 { p 1 , p 2 , ⋯ , p m } , w i t h p i T p j = δ i j \lbrace p_1,p_2,\cdots,p_m\rbrace,with \,\, p_i^Tp_j=\delta_{ij} { p 1 , p 2 , ⋯ , p m } , w i t h p i T p j = δ i j .(这里重点是基向量的方向,与模长无关,所以取基向量为单位向量)
设 X \small X X 为 m × n m\times n m × n 的数据矩阵,一行表示一个变量,一列表示一个数据点 or 样本. 例如,在上一部分的例子中,X = [ 173 159 65 55 887 853 ] X=\begin{bmatrix}173 &159 \\ 65 & 55 \\ 887 & 853\end{bmatrix} X = ⎣ ⎡ 1 7 3 6 5 8 8 7 1 5 9 5 5 8 5 3 ⎦ ⎤ 行表示身高、体重、胸围等变量,列表示某个人的数据,m = 3 , n = 2 m=3,n=2 m = 3 , n = 2 .
对 X \small X X 进行列分块,X = ( x 1 x 2 ⋯ x n ) X=(x_1\,x_2\,\cdots\,x_n) X = ( x 1 x 2 ⋯ x n ) 则 x j x_j x j 表示数据点,p i T x j p_i^Tx_j p i T x j 表示 x j x_j x j 在 p i p_i p i 方向上的投影,则各数据点在 p i p_i p i 方向上投影的方差可以表示为 V a r ( i ) = 1 n ∑ j = 1 n ( p i T x j − p i T x j ‾ ) 2 = 1 n ∑ j = 1 n ( p i T ( x j − x j ‾ ) ) 2 = 1 n ∑ j = 1 n p i T ( x j − x j ‾ ) ( x j − x j ‾ ) T p i = p i T ( 1 n ∑ j = 1 n ( x j − x j ‾ ) ( x j − x j ‾ ) T ) p i = p i T C X p i \begin{aligned}Var(i)&=\frac{1}{n}\sum_{j=1}^n(p_i^Tx_j-\overline{p_i^Tx_j})^2\\&=\frac{1}{n}\sum_{j=1}^n(p_i^T(x_j-\overline{x_j}))^2\\&=\frac{1}{n}\sum_{j=1}^np_i^T(x_j-\overline{x_j})(x_j-\overline{x_j})^Tp_i\\&=p_i^T(\frac{1}{n}\sum_{j=1}^n(x_j-\overline{x_j})(x_j-\overline{x_j})^T)p_i\\&=p_i^TC_Xp_i\end{aligned} V a r ( i ) = n 1 j = 1 ∑ n ( p i T x j − p i T x j ) 2 = n 1 j = 1 ∑ n ( p i T ( x j − x j ) ) 2 = n 1 j = 1 ∑ n p i T ( x j − x j ) ( x j − x j ) T p i = p i T ( n 1 j = 1 ∑ n ( x j − x j ) ( x j − x j ) T ) p i = p i T C X p i 首先考虑使 V a r ( 1 ) \small Var(1) V a r ( 1 ) 最大,这里的 V a r ( 1 ) \small Var(1) V a r ( 1 ) 与上部分中的 V a r ( y 1 ) \small Var(y_1) V a r ( y 1 ) 相同,可利用同样方法得到相同结果,即 p 1 = u 1 p_1=u_1 p 1 = u 1 ,即最大特征值 λ 1 \lambda_1 λ 1 的单位特征向量.
下一步,求 p 2 p_2 p 2 ,使得 V a r ( 2 ) \small Var(2) V a r ( 2 ) 最大,同时使 p 2 T p 1 = 0 p_2^Tp_1=0 p 2 T p 1 = 0 . 这时细心的读者会发现,上一部分中要求的是 p 2 T C X p 1 = 0 \small p_2^TC_Xp_1=0 p 2 T C X p 1 = 0 ,与这里不同. 仔细想想,真的不同吗?∵ C X p 1 = λ 1 p 1 ( λ 1 > 0 ) ∴ p 2 T C X p 1 = λ 1 p 2 T p 1 ∴ p 2 T C X p 1 = 0 ⟺ p 2 T p 1 = 0 \begin{aligned}&\because \,\,C_Xp_1=\lambda_1p_1(\lambda_1>0)\\ &\therefore \,\,p_2^TC_Xp_1=\lambda_1 p_2^Tp_1\\
&\therefore \,\,p_2^TC_Xp_1=0 \iff p_2^Tp_1=0\end{aligned} ∵ C X p 1 = λ 1 p 1 ( λ 1 > 0 ) ∴ p 2 T C X p 1 = λ 1 p 2 T p 1 ∴ p 2 T C X p 1 = 0 ⟺ p 2 T p 1 = 0 嘿嘿,怎么样?这下相信两者相同了吧?
同理,之后的限制条件也相同,即 p i T C X p j = 0 ⟺ p i T p j = 0 \small p_i^TC_Xp_j=0 \iff p_i^Tp_j=0 p i T C X p j = 0 ⟺ p i T p j = 0 .
当 p i = u i p_i=u_i p i = u i ,即 λ i \lambda_i λ i 的单位特征向量时,p i p_i p i 方向上数据投影点的方差取最大值 λ i \lambda_i λ i 且不同基向量之间相互垂直. 所以,PCA 要找的一组标准正交基就是协方差矩阵 C X \small C_X C X 的单位正交特征向量.
3. 最小均方误差解释
对数据矩阵 X m × n \small X_{m\times n} X m × n 进行列分块,每列表示一个数据点,即X = ( x 1 x 2 ⋯ x n ) X=(x_1\,x_2\,\cdots\,x_n) X = ( x 1 x 2 ⋯ x n ) 现用一组标准正交基 { p 1 , p 2 , ⋯ , p m } \lbrace p_1,p_2,\cdots,p_m\rbrace { p 1 , p 2 , ⋯ , p m } 重新表示各数据点x i = a i 1 p 1 + a i 2 p 2 + ⋯ + a i m p m , i = 1 , 2 , ⋯ , n x_i=a_{i1}p_1+a_{i2}p_2+\cdots+a_{im}p_m,\,\,i=1,2,\cdots,n x i = a i 1 p 1 + a i 2 p 2 + ⋯ + a i m p m , i = 1 , 2 , ⋯ , n 由于这是一组标准正交基,容易求得(点击查看过程), a i j = x i T p j a_{ij}=x_i^Tp_j a i j = x i T p j .
考虑 d ( d < m ) d(d<m) d ( d < m ) 维空间 V d = s p a n { p 1 , p 2 , ⋯ , p d } \small V_d=span\lbrace p_1,p_2,\cdots,p_d\rbrace V d = s p a n { p 1 , p 2 , ⋯ , p d } ,我们的目的是在 V d \small V_d V d 中重新表示样本,同时要保证"损失"最小. 设每个数据点的估计值可以表示为x ~ i = ∑ j = 1 d b i j p j + ∑ j = d + 1 m z j p j \widetilde{x}_i=\sum_{j=1}^db_{ij}p_j+\sum_{j=d+1}^mz_jp_j x i = j = 1 ∑ d b i j p j + j = d + 1 ∑ m z j p j 其中 b i j b_{ij} b i j 与数据点有关,z j z_j z j 与数据点无关. 设损失函数为数据点与近似点平方距离的平均值,即J = 1 n ∑ i = 1 n ∥ x i − x ~ i ∥ 2 J=\frac{1}{n} \sum_{i=1}^n \Vert x_i-\widetilde{x}_i\Vert^2 J = n 1 i = 1 ∑ n ∥ x i − x i ∥ 2 为了使该损失函数值最小,我们可以随意选择 b i j , z j b_{ij},\,z_j b i j , z j 和 { p j } \lbrace p_j\rbrace { p j } .
首先考虑 b i j b_{ij} b i j ,x i − x ~ i = ∑ j = 1 d ( a i j − b i j ) p j + ∑ j = d + 1 m ( a i j − z j ) p j x_i-\widetilde{x}_i=\sum_{j=1}^d(a_{ij}-b_{ij})p_j+\sum_{j=d+1}^m(a_{ij}-z_j)p_j x i − x i = j = 1 ∑ d ( a i j − b i j ) p j + j = d + 1 ∑ m ( a i j − z j ) p j 由 { p 1 , p 2 , ⋯ , p m } \lbrace p_1,p_2,\cdots,p_m\rbrace { p 1 , p 2 , ⋯ , p m } 是一组标准正交基,所以 ∥ x i − x ~ i ∥ 2 = ∑ j = 1 d ( a i j − b i j ) 2 + ∑ j = d + 1 m ( a i j − z j ) 2 \Vert x_i-\widetilde{x}_i\Vert^2=\sum_{j=1}^d(a_{ij}-b_{ij})^2+\sum_{j=d+1}^m(a_{ij}-z_j)^2 ∥ x i − x i ∥ 2 = j = 1 ∑ d ( a i j − b i j ) 2 + j = d + 1 ∑ m ( a i j − z j ) 2 J = 1 n ∑ i = 1 n ∥ x i − x ~ i ∥ 2 = 1 n ∑ i = 1 n ( ∑ j = 1 d ( a i j − b i j ) 2 + ∑ j = d + 1 m ( a i j − z j ) 2 ) J=\frac{1}{n} \sum_{i=1}^n \Vert x_i-\widetilde{x}_i\Vert^2=\frac{1}{n} \sum_{i=1}^n(\sum_{j=1}^d(a_{ij}-b_{ij})^2+\sum_{j=d+1}^m(a_{ij}-z_j)^2) J = n 1 i = 1 ∑ n ∥ x i − x i ∥ 2 = n 1 i = 1 ∑ n ( j = 1 ∑ d ( a i j − b i j ) 2 + j = d + 1 ∑ m ( a i j − z j ) 2 ) 选择 b i j b_{ij} b i j 使 J \small J J 最小,可以看出当 b i j = a i j = x i T w j b_{ij}=a_{ij}=x_i^Tw_j b i j = a i j = x i T w j 时,J \small J J 取最小值,此时 J = 1 n ∑ i = 1 n ∑ j = d + 1 m ( a i j − z j ) 2 = 1 n ∑ j = d + 1 m ∑ i = 1 n ( a i j − z j ) 2 J=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m(a_{ij}-z_j)^2=\frac{1}{n} \sum_{j=d+1}^m\sum_{i=1}^n(a_{ij}-z_j)^2 J = n 1 i = 1 ∑ n j = d + 1 ∑ m ( a i j − z j ) 2 = n 1 j = d + 1 ∑ m i = 1 ∑ n ( a i j − z j ) 2 也可通过对 b i j b_{ij} b i j 求导,令偏导数为零,得到相同结果.
然后,考虑 z j z_j z j ,J = 1 n ∑ j = d + 1 m ∑ i = 1 n ( z j 2 − 2 z j a i j + a i j 2 ) = 1 n ∑ j = d + 1 m ( n z j 2 − 2 ( ∑ i = 1 n a i j ) z j + ∑ i = 1 n a i j 2 ) \begin{aligned}J&=\frac{1}{n} \sum_{j=d+1}^m\sum_{i=1}^n(z_j^2-2z_ja_{ij}+a_{ij}^2)\\&=\frac{1}{n} \sum_{j=d+1}^m(nz_j^2-2(\sum_{i=1}^na_{ij})z_j+\sum_{i=1}^na_{ij}^2)\end{aligned} J = n 1 j = d + 1 ∑ m i = 1 ∑ n ( z j 2 − 2 z j a i j + a i j 2 ) = n 1 j = d + 1 ∑ m ( n z j 2 − 2 ( i = 1 ∑ n a i j ) z j + i = 1 ∑ n a i j 2 ) 对 z j z_j z j 求偏导,令导数为零∂ J ∂ z j = 2 n z j − 2 ∑ i = 1 n a i j = 0 \frac{\partial J}{\partial z_j}=2nz_j-2\sum_{i=1}^na_{ij}=0 ∂ z j ∂ J = 2 n z j − 2 i = 1 ∑ n a i j = 0 则z j = 1 n ∑ i = 1 n a i j = 1 n ∑ i = 1 n x i T p j = ( 1 n ∑ i = 1 n x i T ) p j = x ‾ T p j z_j=\frac{1}{n}\sum_{i=1}^na_{ij}= \frac{1}{n}\sum_{i=1}^nx_i^Tp_j=(\frac{1}{n}\sum_{i=1}^nx_i^T)p_j=\overline{x}^Tp_j z j = n 1 i = 1 ∑ n a i j = n 1 i = 1 ∑ n x i T p j = ( n 1 i = 1 ∑ n x i T ) p j = x T p j 此时 J = 1 n ∑ i = 1 n ∑ j = d + 1 m ( a i j − z j ) 2 = 1 n ∑ i = 1 n ∑ j = d + 1 m ( x i T p j − x ‾ T p j ) 2 = 1 n ∑ i = 1 n ∑ j = d + 1 m ( ( x i − x ‾ ) T p j ) ( ( x i − x ‾ ) T p j ) = 1 n ∑ j = d + 1 m ∑ i = 1 n p j T ( x i − x ‾ ) ( x i − x ‾ ) T p j = ∑ j = d + 1 m p j T ( 1 n ∑ i = 1 n ( x i − x ‾ ) ( x i − x ‾ ) T ) p j = ∑ j = d + 1 m p j T C X p j \begin{aligned}J&=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m(a_{ij}-z_j)^2\\&=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m(x_i^Tp_j-\overline{x}^Tp_j)^2\\&=\frac{1}{n} \sum_{i=1}^n\sum_{j=d+1}^m((x_i-\overline{x})^Tp_j)((x_i-\overline{x})^Tp_j)\\&=\frac{1}{n} \sum_{j=d+1}^m\sum_{i=1}^np_j^T(x_i-\overline{x})(x_i-\overline{x})^Tp_j\\&= \sum_{j=d+1}^mp_j^T (\frac{1}{n}\sum_{i=1}^n (x_i-\overline{x})(x_i-\overline{x})^T)p_j\\&=\sum_{j=d+1}^m p_j^TC_Xp_j\end{aligned} J = n 1 i = 1 ∑ n j = d + 1 ∑ m ( a i j − z j ) 2 = n 1 i = 1 ∑ n j = d + 1 ∑ m ( x i T p j − x T p j ) 2 = n 1 i = 1 ∑ n j = d + 1 ∑ m ( ( x i − x ) T p j ) ( ( x i − x ) T p j ) = n 1 j = d + 1 ∑ m i = 1 ∑ n p j T ( x i − x ) ( x i − x ) T p j = j = d + 1 ∑ m p j T ( n 1 i = 1 ∑ n ( x i − x ) ( x i − x ) T ) p j = j = d + 1 ∑ m p j T C X p j
现在就只剩下最后一个任务,合理选取 p 1 , p 2 , ⋯ , p m p_1,p_2,\cdots,p_m p 1 , p 2 , ⋯ , p m ,使得 J \small J J 最小且 p i T p j = δ i j p_i^Tp_j=\delta_{ij} p i T p j = δ i j .
先令 d = m − 1 d=m-1 d = m − 1 ,利用与第一部分同样的方法,即根据 C X \small C_X C X 正交相似与对角形,求得 p m p_m p m 为最小特征值 λ m \lambda_m λ m 的单位特征向量,此时 J \small J J 的最小值为 λ m \lambda_m λ m .
而后,令 d = m − 2 , m − 3 , ⋯ , 1 d=m-2,m-3,\cdots,1 d = m − 2 , m − 3 , ⋯ , 1 ,可得:
p i p_i p i 为特征值 λ i \lambda_i λ i 的单位特征向量,i = 1 , 2 , ⋯ , m i=1,2,\cdots,m i = 1 , 2 , ⋯ , m ,J J J 的最小值为 J m i n = ∑ j = d + 1 m λ j J_{min}=\sum_{j=d+1}^m \lambda_j J m i n = j = d + 1 ∑ m λ j 参考文献: [1] Shlens J. A tutorial on principal component analysis. arXiv preprint arXiv: 14016.1100, 2014.
[2] Christopher M. Bishop. Pattern Recognition and
Machine Learning [M]. Singapore:Springer, 2006.
[3] 周长宇 . A Tutorial on Principal Component Analysis(译 ). https://blog.csdn.net/zhouchangyu1221/article/details/103949967, 2020-01-22.
[4] 范金城,梅长林.数据分析 [M].第二版.北京:科学出版社 , 2018.