多元高斯分布的MLE、贝叶斯条件概率和线性判别分析LDA的生成方法总结

Gaussian model

给出 $d$ 维随机向量(pattern) $x$ ，即随机变量 ${x_{1}, x_{2}, . . ., x_{n}}$ 其高斯分布表示：

\begin{matrix} (1) & q (x; μ, \sum) = \frac{1}{(2 π)^{\frac{d}{2}} det (\sum)^{\frac{1}{2}}} \exp (\frac{1}{2} (x μ)^{T} \sum^{1} (x μ)) \end{matrix}

其中

μ

是

d

维度的列向量代表期望(expectation)，

Σ

是

d \times d

的协方差矩阵(variance-covariance matrix),即：

\begin{matrix} (2) & μ = E [x] = \int x q (x; μ, \sum) d x \end{matrix}

\begin{matrix} (3) & \sum = V [x] = \int (x μ) (x μ)^{T} q (x; μ, \sum) d x \end{matrix}

假设 $n$ 个样本之间 $i . i . d .$ ,则高斯分布的对数似然(log-likelihood)是：

\begin{matrix} (4) & l o g L (μ, \sum) = \frac{n d \log 2 π}{2} \frac{n \log (d e t (\sum))}{2} \frac{1}{2} \sum_{i = 1}^{n} (x_{i} μ)^{T} \sum^{1} (x_{i} μ) \end{matrix}

由(4)求偏导得到似然方程：

\begin{matrix} (5) & \frac{\log L}{μ} = n \sum^{1} μ + \sum^{1} \sum_{i = 1}^{n} x_{i} \end{matrix}

\begin{matrix} (6) & \frac{\log L}{\sum} = \frac{n}{2} \sum^{1} + \frac{1}{2} \sum^{1} (\sum_{i = 1}^{n} (x_{i} μ) (x_{i} μ)^{T}) \sum^{1} \end{matrix}

令(5)和(6)为0，得最大似然估计(maximum likelihood estimator)如下：

\begin{matrix} (7) & {\hat{μ}}_{M L} = \frac{1}{n} \sum_{i = 1}^{m} x_{i} \end{matrix}

\begin{matrix} (8) & {\sum^{^}}_{M L} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} {\hat{μ}}_{M L}) (x_{i} {\hat{μ}}_{M L})^{T} \end{matrix}

这里(7)和(8)分别对应于样本均值和样本协方差矩阵，并假设有足够多的训练样本(training smaples)，所以 ${\hat{Σ}}_{M L}$ 是可逆的(invertible).

注释：这里的样本协方差矩阵是有偏的，为什么，见这里
注释：协方差代表的意义是什么?这里

做假设：随机变量 ${x_{1}, x_{2}, . . ., x_{n}}$ 之间相互独立，则有相关系数 $p = 0$ (这里相关系数为0是两变量独立的必要非充分条件。相关系数反映的是两变量间的线性关系，但是变量间除了线性关系还有其它关系，这时候相关系数就不能作为一种度量了。)，则协方差矩阵为：
$\begin{matrix} (9) & \sum = d i a g ((σ^{(1)})^{2}, . . ., (σ^{(d)})^{2}) \end{matrix}$

得到第 $j$ 个标准差的ML：

\begin{matrix} (10) & {\hat{σ}}_{M L}^{(j)} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (x_{i}^{(j)} μ_{i}^{(j)})^{2}} \end{matrix}

在进一步假设 $n$ 个变量的方差都一样的是 $σ^{2}$ ,得到协方差矩阵为：
$\begin{matrix} (11) & {\hat{σ}}_{M L} = \sqrt{\frac{1}{n d} \sum_{i = 1}^{n} (x_{i} - μ)^{T} (x_{i} - μ)} = \sqrt{\frac{1}{d} \sum_{j = 1}^{d} {({\hat{σ}}_{M L}^{(j)})}^{2}} \end{matrix}$
- 这里的 $\frac{1}{n d}$ 是表示除以 $n$ 个样本之后剩下的是 $d$ 个随机变量协方差的和，由于都相等，所以除以 $d$ 就是样本协方差的估计。

Class-Posterior Probability: $\log p (y | x)$ and Class-Prior Probability: $p (y)$

假设类条件(class-conditional)概率密度 $p (x | y)$ 服从正态分布且样本之间独立同分布，由以上的推导可以得到类条件概率密度 $p (x | y)$ 的MLE：

\begin{matrix} (1) & {\hat{μ}}_{y} = \frac{1}{n_{y}} \sum_{i : y_{i} = y} x_{i} \end{matrix}

\begin{matrix} (2) & {\sum^{^}}_{y} = \frac{1}{n_{y}} \sum_{i : y_{i} = y}^{n} (x_{i} {\hat{μ}}_{y}) (x_{i} {\hat{μ}}_{y})^{T} \end{matrix}

其中

n_{y}

表示训练样本中属于类

y

的数量，

{\hat{μ}}_{y}

和

{\hat{Σ}}_{y}

表示训练样本属于类别

y

的前提下，该类训练样本的期望和协方差矩阵。

那么由贝叶斯公式：

\begin{matrix} (3) & p (y | x) = \frac{p (x | y) p (y)}{p (x)} \end{matrix}

推得：

\begin{matrix} (4) & \log p (y | x) = \log p (x | y) + \log p (y) \log p (x) \end{matrix}

其中类别

y

的先验最简单的估计就是在样本中的比值：

\begin{matrix} (5) & \hat{p} (y) = \frac{n_{y}}{n} \end{matrix}

带入公式(1)(2)(5)到(4)中得：

\begin{matrix} (6) & \log \hat{p} (y | x) = \frac{1}{2} (x \hat{μ})^{T} {\sum^{^}}_{y} (x \hat{μ}) \frac{1}{2} \log (d e t ({\sum^{^}}_{y})) + \log \frac{n_{y}}{n} + C \end{matrix}

这里

C

是常数。

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant.

LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities

Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure

假设训练样本 ${\vec{x}, y}$ ，这里 $y$ 等于0或者1表示两个类别；LDA假设条件概率密度 $p (\vec{x} | y = 0)$ 和 $p (\vec{x} | y = 1)$ 都分别服从正态分布 $({\vec{μ}}_{0}, Σ_{0})$ 和
$({\vec{μ}}_{1}, Σ_{1})$
,同贝叶斯里的公式(6)得到属于类别1而不是类别2的公式(类1的公式减去类0的)：

* Without any further assumptions, the resulting classifier is referred to as QDA (quadratic discriminant analysis).
* Quadratic discriminant analysis (QDA) 的定义：
- is closely related to linear discriminant analysis (LDA), where it is assumed that the measurements from each class are normally distributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical. When the normality assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test.

LDA进一步做假设:
* Homoscedasticity: In statistics, a sequence or a vector of random variables is if all random variables in the sequence or vector have the same finite variance. This is also known as homogeneity of variance.
即随机变量都方差相同。
* Full Rank：满秩，应该是在各个类别下，随机变量之间互相不能被彼此表示，因为
- 半正定矩阵的行列式是非负的;协方差是半正定矩阵，行列式不为负。若其中一个随机变量被其他随机变量表示，则行列式为零。所以要求满秩即时类内的随机变量相互独立，这是因为行列式会作为分母成为密度函数的正则化项，即求积分为1保证密度函数的成立。
- 若对称矩阵A的每个元素均为实数，A是Hermite矩阵；所以样本中每一类的协方差矩阵都是hermite矩阵。埃尔米特矩阵是正规矩阵，因此埃尔米特矩阵可被酉对角化，而且得到的对角阵的元素都是实数。这意味着埃尔米特矩阵的特征值都是实的，而且不同的特征值所对应的特征向量相互正交，因此可以在这些特征向量中找出一组C的正交基。

有

得到

\vec{w} \vec{x} > c

其中

标签

贝叶斯

贝叶斯估计

正态分布

多元高斯分布的MLE、贝叶斯条件概率和线性判别分析LDA的生成方法总结

Gaussian model

Class-Posterior Probability:logp(y|x)logp(y|x) and Class-Prior Probability:p(y)p(y)

Class-Posterior Probability: $\log p (y | x)$ and Class-Prior Probability: $p (y)$