pca

Performing PCA on large sparse matrix by using sklearn

阅读更多关于 Performing PCA on large sparse matrix by using sklearn

问题 I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix However, I always get error. Can someone point out what I am doing wrong. Input matrix 'X_train' contains numbers in float64: >>>type(X_train) <class 'scipy.sparse.csr.csr_matrix'> >>>X_train.shape (2365436, 1617899) >>>X_train.ndim 2 >>>X_train[0] <1x1617899 sparse matrix of type '<type 'numpy

PCA first or normalization first?

阅读更多关于 PCA first or normalization first?

When doing regression or classification, what is the correct (or better) way to preprocess the data? Normalize the data -> PCA -> training PCA -> normalize PCA output -> training Normalize the data -> PCA -> normalize PCA output -> training Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques. Chris Taylor You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C : >> C = [1 0.5; 0

Python scikit learn pca.explained_variance_ratio_ cutoff

阅读更多关于 Python scikit learn pca.explained_variance_ratio_ cutoff

问题 Guru, When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained. However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to "99% of variance is retained"? Could anyone enlighten? Thanks. The Python Scikit learn PCA manual is here http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA 回答1: Yes, you are nearly right.

Matlab Principal Component Analysis (eigenvalues order)

阅读更多关于 Matlab Principal Component Analysis (eigenvalues order)

问题 I want to use the "princomp" function of Matlab but this function gives the eigenvalues in a sorted array. This way I can't find out to which column corresponds which eigenvalue. For Matlab, m = [1,2,3;4,5,6;7,8,9]; [pc,score,latent] = princomp(m); is the same as m = [2,1,3;5,4,6;8,7,9]; [pc,score,latent] = princomp(m); That is, swapping the first two columns does not change anything. The result (eigenvalues) in latent will be: (27,0,0) The information (which eigenvalue corresponds to which

How to get the 1st Principal Component by PCA using Python?

阅读更多关于 How to get the 1st Principal Component by PCA using Python?

问题 I have a set of 2D vectors presented in a n*2 matrix form. I wish to get the 1st principal component, i.e. the vector that indicates the direction with the largest variance. I have found a rather detailed documentation on this from Rice University. Based on this, I have imported the data and done the following: import numpy as np dataMatrix = np.array(aListOfLists) # Convert a list-of-lists into a numpy array. aListOfLists is the data points in a regular list-of-lists type matrix. myPCA = PCA

[机器学习理论] 降维算法PCA、SVD(部分内容，有待更新)

阅读更多关于 [机器学习理论] 降维算法PCA、SVD(部分内容，有待更新)

几个概念正交矩阵在矩阵论中，正交矩阵（orthogonal matrix）是一个方块矩阵，其元素为实数，而且行向量与列向量皆为正交的单位向量，使得该矩阵的转置矩阵为其逆矩阵 : 其中，为单位矩阵。正交矩阵的行列式值必定为或，因为：来源： https://www.cnblogs.com/likedata/p/11405547.html

神经网络训练-数据增强方法

阅读更多关于神经网络训练-数据增强方法

1.镜像+随机裁剪+放射变换(旋转) 2.颜色变换通过对图像的RGB通道值进行修改，改变颜色分布，达到增强图像数据多样性的目的。常用的数学方法可以用PCA，网上有开源的图像PCA算法提供使用。来源： CSDN 作者：十八级台风链接： https://blog.csdn.net/irobot2016/article/details/88210328

阅读更多关于 PCA

PCA本质上是一个有损的特征压缩过程，但是我们期望损失的精度尽可能地少，也就是希望压缩的过程中保留最多的原始信息。要达到这种目的，我们希望降维（投影）后的数据点尽可能地分散。基于这种思想，我们希望投影后的数据点尽可能地分散。而这种分散程度在数学上可以利用方差来表示。设降维后的特征为 A，也就是希望 Var(A) = ${\sigma}_{k}(n)=\sum_{d|n}^{}{d}^{k}$ 来源： https://www.cnblogs.com/xcxy-boke/p/11405052.html

What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?

阅读更多关于 What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?

Suppose there is a matrix B , where its size is a 500*1000 double (Here, 500 represents the number of observations and 1000 represents the number of features). sigma is the covariance matrix of B , and D is a diagonal matrix whose diagonal elements are the eigenvalues of sigma . Assume A is the eigenvectors of the covariance matrix sigma . I have the following questions: I need to select the first k = 800 eigenvectors corresponding to the eigenvalues with the largest magnitude to rank the selected features. The final matrix named Aq . How can I do this in MATLAB? What is the meaning of these

4.pca与梯度上升法

阅读更多关于 4.pca与梯度上升法

（一）什么是pca pca，也就是主成分分析法(principal component analysis)，主要是用来对数据集进行降维处理。举个最简单的例子，我要根据姓名、年龄、头发的长度、身高、体重、皮肤的白皙程度(随便想的)等特征来预测一个人的性别，但这些特征中有一个是最没有用的，是什么的？显然是年龄，因为年龄的大小跟这个人的性别无关。还有姓名，这个特征显然起不到决定性作用，因为有的男孩的名字起的就像女孩(比如我本人)，反之亦然，但是起码绝大多数情况还是能判断的。同理还有身高，一个180CM的很大概率是男孩，当然女孩也有180cm的，比如模特。像这样我从样本的特征中，挑选出最能代表样本、或者对样本预测起到决定性作用最大的n个特征，就叫做主成分分析。为什么会有pca呢？可以想象一个，显示生活中，样本的特征很多，成百上千个也是正常的，但是我们训练不可能用全部的特征进行训练，因为肯定有很多特征是没有用的，或者说起到的作用是很小的，我们的目的就是希望找到起到决定性最大的n个特征。主成分分析的特征一个非监督的机器学习算法主要用于数据的降维通过降维，可以发现更便于人类理解的特征其他特征：可视化，去噪等等我们举一个只有两个特征的例子如果我们只考虑特征1，不考虑特征2的话，那么显然，蓝色的点要从二维映射到一维那么同理，如果我们只考虑特征2，不考虑特征1的话，那么显然会是这样