pca

doing PCA on very large data set in R

喜你入骨 提交于 2019-12-03 09:59:43
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I have a very large training set (~2Gb) in a CSV file. The file is too large to read directly into memory ( read.csv() brings the computer to a halt) and I would like to reduce the size of the data file using PCA. The problem is that (as far as I can tell) I need to read the file into memory in order to run a PCA algorithm (e.g., princomp() ). I have tried the bigmemory package

principal component analysis (PCA) in R: which function to use?

余生颓废 提交于 2019-12-03 09:59:21
问题 Can anyone explain what the major differences between the prcomp and princomp functions are? Is there any particular reason why I should choose one over the other? In case this is relevant, the type of application I am looking at is a quality control analysis for genomic (expression) data sets. Thank you! 回答1: There are differences between these two functions w/r/t the function parameters (what you can/must pass in when you call the function); the values returned by each; and the numerical

Extracting PCA components with sklearn

馋奶兔 提交于 2019-12-03 09:56:16
I am using sklearn's PCA for dimensionality reduction on a large set of images. Once the PCA is fitted, I would like to see what the components look like. One can do so by looking at the components_ attribute. Not realizing that was available, I did something else instead: each_component = np.eye(total_components) component_im_array = pca.inverse_transform(each_component) for i in range(num_components): component_im = component_im_array[i, :].reshape(height, width) # do something with component_im In other words, I create an image in the PCA space that has all features but 1 set to 0. By

How is the complexity of PCA O(min(p^3,n^3))?

放肆的年华 提交于 2019-12-03 08:49:33
问题 I've been reading a paper on Sparse PCA, which is: http://stats.stanford.edu/~imj/WEBLIST/AsYetUnpub/sparse.pdf And it states that, if you have n data points, each represented with p features, then, the complexity of PCA is O(min(p^3,n^3)) . Can someone please explain how/why? 回答1: Covariance matrix computation is O(p 2 n); its eigen-value decomposition is O(p 3 ). So, the complexity of PCA is O(p 2 n+p 3 ). O(min(p 3 ,n 3 )) would imply that you could analyze a two-dimensional dataset of any

PCA of RGB Image

匿名 (未验证) 提交于 2019-12-03 08:48:34
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm trying to figure out how to use PCA to decorrelate an RGB image in python. I'm using the code found in the O'Reilly Computer vision book: from PIL import Image from numpy import * def pca(X): # Principal Component Analysis # input: X, matrix with training data as flattened arrays in rows # return: projection matrix (with important dimensions first), # variance and mean #get dimensions num_data,dim = X.shape #center data mean_X = X.mean(axis=0) for i in range(num_data): X[i] -= mean_X if dim>100: print 'PCA - compact trick used' M = dot(X

Getting model attributes from scikit-learn pipeline

匿名 (未验证) 提交于 2019-12-03 08:33:39
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I typically get PCA loadings like this: pca = PCA(n_components=2) X_t = pca.fit(X).transform(X) loadings = pca.components_ If I run PCA using a scikit-learn pipline ... from sklearn.pipeline import Pipeline pipeline = Pipeline(steps=[ ('scaling',StandardScaler()), ('pca',PCA(n_components=2)) ]) X_t=pipeline.fit_transform(X) ... is it possible to get the loadings? Simply trying loadings = pipeline.components_ fails: AttributeError: 'Pipeline' object has no attribute 'components_' Thanks! (Also interested in extracting attributes like coef_

PCA in OpenCV using the new C++ interface

匿名 (未验证) 提交于 2019-12-03 07:50:05
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: As an aside: Apologies if I'm flooding SO with OpenCV questions :p I'm currently trying to port over my old C code to use the new C++ interface and I've got to the point where I'm rebuilidng my Eigenfaces face recogniser class. Mat img = imread("1.jpg"); Mat img2 = imread("2.jpg"); FaceDetector* detect = new HaarDetector("haarcascade_frontalface_alt2.xml"); // convert to grey scale Mat g_img, g_img2; cvtColor(img, g_img, CV_BGR2GRAY); cvtColor(img2, g_img2, CV_BGR2GRAY); // find the faces in the images Rect r = detect->getFace(g_img); Mat

What is the fastest way to calculate first two principal components in R?

拈花ヽ惹草 提交于 2019-12-03 07:08:11
问题 I am using princomp in R to perform PCA. My data matrix is huge (10K x 10K with each value up to 4 decimal points). It takes ~3.5 hours and ~6.5 GB of Physical memory on a Xeon 2.27 GHz processor. Since I only want the first two components, is there a faster way to do this? Update : In addition to speed, Is there a memory efficient way to do this ? It takes ~2 hours and ~6.3 GB of physical memory for calculating first two components using svd(,2,) . 回答1: You sometimes gets access to so-called

Test significance of clusters on a PCA plot

荒凉一梦 提交于 2019-12-03 05:08:57
问题 Is it possible to test the significance of clustering between 2 known groups on a PCA plot? To test how close they are or the amount of spread (variance) and the amount of overlap between clusters etc. 回答1: You could use a PERMANOVA to partition the euclidean distance by your groups: data(iris) require(vegan) # PCA iris_c <- scale(iris[ ,1:4]) pca <- rda(iris_c) # plot plot(pca, type = 'n', display = 'sites') cols <- c('red', 'blue', 'green') points(pca, display='sites', col = cols[iris

PCA on sklearn - how to interpret pca.components_

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-03 04:32:43
问题 I ran PCA on a data frame with 10 features using this simple code: pca = PCA() fit = pca.fit(dfPca) The result of pca.explained_variance_ratio_ shows: array([ 5.01173322e-01, 2.98421951e-01, 1.00968655e-01, 4.28813755e-02, 2.46887288e-02, 1.40976609e-02, 1.24905823e-02, 3.43255532e-03, 1.84516942e-03, 4.50314168e-16]) I believe that means that the first PC explains 52% of the variance, the second component explains 29% and so on... What I dont undestand is the output of pca.components_ . If I