pca

selection of features using PCA

好久不见. 提交于 2019-11-29 09:54:50
问题 I am doing unsupervised classification. For this I have 8 features (Variance of Green, Std. div. of Green , Mean of Red, Variance of Red, Std. div. of Red, Mean of Hue, Variance of Hue, Std. div. of Hue) for classification per image and I want to select 3 most significant features using PCA. I have written the following code for feature selection (where dimension of feature is : 179X8) : for c=1:size(feature,1) feature(c,:)=feature(c,:)-mean(feature) end DataCov=cov(feature); % covariance

吴恩达机器学习—降维(11)

你离开我真会死。 提交于 2019-11-29 08:17:10
1. 概述 希望有足够多的特征(知识)来保准学习模型的训练效果,但高维的特征也有几个如下不好的地方: 学习性能下降,知识越多,吸收知识(输入),并且精通知识(学习)的速度就越慢; 过多的特征难于分辨,很难第一时间认识某个特征代表的意义 特征冗余 特征降维的一般手段就是将高维特征投影到低维空间。 降维的作用:数据压缩和数据可视化。 例子:2D---->1D 3D------>2D 2. PCA(主成分分析) PCA,Principle Component Analysis,即主成分分析法,是特征降维的最常用手段。 PCA 能从冗余特征中提取主要成分,在不太损失模型质量的情况下,提升模型训练速度。 PCA 使得各个特征的投影误差足够小,尽可能的保留原特征具有的信息。 (1)PCA算法流程 需求:将特征维度从 n 维降到 k 维 执行流程: 1)特征标准化,平衡各个特征尺度: x j ( i ) = x j ( i ) − μ j s j , μ j 为 特 征 j 的 均 值 , s j 为 特 征 j 的 标 准 差 。 x^{(i)}_j=\frac{x^{(i)}_j−μ_j}{s_j},μ_j 为特征 j 的均值,s_j 为特征 j 的标准差。 x j ( i ) ​ = s j ​ x j ( i ) ​ − μ j ​ ​ , μ j ​ 为 特 征 j 的 均 值 , s

How to use eigenvectors obtained through PCA to reproject my data?

只谈情不闲聊 提交于 2019-11-29 07:42:25
I am using PCA on 100 images. My training data is 442368x100 double matrix. 442368 are features and 100 is number of images. Here is my code for finding the eigenvector. [ rows, cols] = size(training); maxVec=rows; maxVec=min(maxVec,rows); train_mean=mean(training,2); A=training-train_mean*ones(1,cols); A=A'*A; [evec,eval]=eig(A); [eval ind] = sort(-1*diag(eval)); evec= evec(:, ind(1:100)); Now evec is an eigenvector matrix of order of 100x100 double and now I have got 100 eigenvectors sorted. Questions: Now, if I want to transform my testing data using above calculated eigenvectors then how

How can scikit-learning perform PCA on sparse data in libsvm format?

落花浮王杯 提交于 2019-11-29 07:28:09
I am using scikit-learning to do some dimension reduce task. My training/test data is in the libsvm format. It is a large sparse matrix in half million columns. I use load_svmlight_file function load the data, and by using SparsePCA, the scikit-learning throw out an exception of the input data error. How to fix it? Sparse PCA is an algorithm for finding a sparse decomposition (the components have a sparsity constraint) on dense data. If you want to do vanilla PCA on sparse data you should use sklearn.decomposition.RandomizedPCA that implements an scalable approximate method that works on both

R function prcomp fails with NA's values even though NA's are allowed

£可爱£侵袭症+ 提交于 2019-11-29 05:29:55
I am using the function prcomp to calculate the first two principal components. However, my data has some NA values and therefore the function throws an error. The na.action defined seems not to work even though it is mentioned in the help file ?prcomp Here is my example: d <- data.frame(V1 = sample(1:100, 10), V2 = sample(1:100, 10)) prcomp(d, center = TRUE, scale = TRUE, na.action = na.omit) d$V1[5] <- NA d$V2[7] <- NA prcomp(d, center = TRUE, scale = TRUE, na.action = na.omit) I am using the newest R version 2.15.1 for Mac OS X. Can anybody see the reason while prcomp fails? Here is my new

Using memmap files for batch processing

痴心易碎 提交于 2019-11-29 05:06:45
I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA. Therefore, I shifted to using Iterative PCA. Dataset Size-(140000,3504) The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory. This is really good, but unsure on how take advantage of this. I tried load one memmap hoping it would access it in chunks but my RAM blew. My code below ends up using a lot of RAM: ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+',

Apply PCA on very large sparse matrix

此生再无相见时 提交于 2019-11-29 03:18:20
I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA. So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by

Finding the dimension with highest variance using scikit-learn PCA

倾然丶 夕夏残阳落幕 提交于 2019-11-29 00:44:52
问题 I need to use pca to identify the dimensions with the highest variance of a certain set of data. I'm using scikit-learn's pca to do it, but I can't identify from the output of the pca method what are the components of my data with the highest variance. Keep in mind that I don't want to eliminate those dimensions, only identify them. My data is organized as a matrix with 150 rows of data, each one with 4 dimensions. I'm doing as follow: pca = sklearn.decomposition.PCA() pca.fit(data_matrix)

PCA FactoMineR plot data

痴心易碎 提交于 2019-11-29 00:38:02
I'm running an R script generating plots of the PCA analysis using FactorMineR . I'd like to output the coordinates for the generated PCA plots but I'm having trouble finding the right coordinates. I found results1$ind$coord and results1$var$coord but neither look like the default plot. I found http://www.statistik.tuwien.ac.at/public/filz/students/seminar/ws1011/hoffmann_ausarbeitung.pdf and http://factominer.free.fr/classical-methods/principal-components-analysis.html but neither describe the contents of the variable created by the PCA library(FactoMineR) data1 <- read.table(file=args[1],

PCA分析的疑问

一曲冷凌霜 提交于 2019-11-28 23:48:21
R 与python scikit-learn PCA的主成分结果有部分是反的 通过R和python分别计算出来的PCA的结果存在某些主成分的结果是相反的,这些结果是没有问题的,只是表示这个分量被反转了,结果同样是有效的。 PCA的本质是寻找一条正交的线,这条线应该是可以有不同方向的 数据格式 148 41 72 78 139 34 71 76 160 49 77 86 149 36 67 79 159 45 80 86 142 31 66 76 153 43 76 83 150 43 77 79 151 42 77 80 139 31 68 74 140 29 64 74 161 47 78 84 158 49 78 83 140 33 67 77 137 31 66 73 152 35 73 79 149 47 82 79 145 35 70 77 160 47 74 87 156 44 78 85 151 42 73 82 147 38 73 78 157 39 68 80 147 30 65 75 157 48 80 88 151 36 74 80 144 36 68 76 141 30 67 76 139 32 68 73 148 38 70 78 python计算PCA代码 from sklearn.decomposition import PCA pca = PCA()