pca

Dimension of data before and after performing PCA

[亡魂溺海] 提交于 2019-12-03 13:34:40
问题 I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn. After removing labels from the training data, I add each row in CSV into a list like this: for row in csv: train_data.append(np.array(np.int64(row))) I do the same for the test data. I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?): def preprocess(train_data, test_data, pca_components=100): # convert to matrix train_data = np.mat(train_data) # reduce both

Sklearn PCA is pca.components_ the loadings?

谁说胖子不能爱 提交于 2019-12-03 13:31:55
Sklearn PCA is pca.components_ the loadings? I am pretty sure it is, but I am trying to follow along a research paper and I am getting different results from their loadings. I can't find it within the sklearn documentation. pca.components_ is the orthogonal basis of the space your projecting the data into. It has shape (n_components, n_features) . If you want to keep the only the first 3 components (for instance to do a 3D scatter plot) of a datasets with 100 samples and 50 dimensions (also named features), pca.components_ will have shape (3, 50) . I think what you call the "loadings" is the

How to use princomp () function in R when covariance matrix has zero's?

南笙酒味 提交于 2019-12-03 13:00:25
While using princomp() function in R, the following error is encountered : "covariance matrix is not non-negative definite" . I think, this is due to some values being zero (actually close to zero, but becomes zero during rounding) in the covariance matrix. Is there a work around to proceed with PCA when covariance matrix contains zeros ? [FYI : obtaining the covariance matrix is an intermediate step within the princomp() call. Data file to reproduce this error can be downloaded from here - http://tinyurl.com/6rtxrc3] The first strategy might be to decrease the tolerance argument. Looks to me

What's wrong with my PCA?

[亡魂溺海] 提交于 2019-12-03 12:41:45
My code: from numpy import * def pca(orig_data): data = array(orig_data) data = (data - data.mean(axis=0)) / data.std(axis=0) u, s, v = linalg.svd(data) print s #should be s**2 instead! print v def load_iris(path): lines = [] with open(path) as input_file: lines = input_file.readlines() data = [] for line in lines: cur_line = line.rstrip().split(',') cur_line = cur_line[:-1] cur_line = [float(elem) for elem in cur_line] data.append(array(cur_line)) return array(data) if __name__ == '__main__': data = load_iris('iris.data') pca(data) The iris dataset: http://archive.ics.uci.edu/ml/machine

PCA multiplot in R

﹥>﹥吖頭↗ 提交于 2019-12-03 12:34:17
问题 I have a dataset that looks like this: India China Brasil Russia SAfrica Kenya States Indonesia States Argentina Chile Netherlands HongKong 0.0854026763 0.1389383234 0.1244184371 0.0525460881 0.2945586244 0.0404562539 0.0491597968 0 0 0.0618342901 0.0174891774 0.0634064181 0 0.0519483159 0.0573851759 0.0756806292 0.0207164181 0.0409872092 0.0706355932 0.0664503936 0.0775285039 0.008545575 0.0365674701 0.026595575 0.064280902 0.0338135148 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0943708876 0 0 0.0967733329

PCA|factor extraction|CA

前提是你 提交于 2019-12-03 12:22:31
PCA :主成分分析 相关矩阵,找特征值,找每个特征值对应特征向量,即组成主组成式子: 每个式子指向一个结果 y ,找一条线将这些 y 分开。有 11 个变量就有 11 个新坐标轴,通过点到直线距离来区分。 信息必须集中在前几个主成分上。比如 PC1 表示 3 个变异。 主成分分析的前提是原始数据不能不同 x 指向同一个 y 。 主成分分析不能用来代表某一组因素的共同作用。 PCA 是一类因子分析,在特征值提取时可以选择不同算法。 取第一列和第二列主成分,可以得到二维图: 通过改变坐标轴可将差异表达的更清楚。 PCA 和 cluster 区别是 Cluster 目标是将 y 分类, PCA 将特征值分类。 对应分析:卡方分析反映出期望与观测值的差异,即是信息点,行列之间的不规律性。对卡方矩阵做主成分分析,原矩阵和转置后的矩阵都做一遍。所以 PCA 与 CA 的比较: CA 要求原始数据可以不单调,不要求正态。 PCA 要求原始数据可以不单调,最后的主成分转化为一个个欧式距离,要求正态。 来源: https://www.cnblogs.com/yuanjingnan/p/11795984.html

is it possible Apply PCA on any Text Classification?

早过忘川 提交于 2019-12-03 11:44:30
I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification). Now, I'm trying to apply PCA on this data, but python is giving some errors. My code for classification with Naive Bayes : from sklearn import PCA from sklearn import RandomizedPCA from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB vectorizer = CountVectorizer() classifer = MultinomialNB(alpha=.01) x_train = vectorizer.fit_transform(temizdata)

Using Numpy (np.linalg.svd) for Singular Value Decomposition

╄→尐↘猪︶ㄣ 提交于 2019-12-03 10:35:35
Im reading Abdi & Williams (2010) "Principal Component Analysis", and I'm trying to redo the SVD to attain values for further PCA. The article states that following SVD: X = P D Q^t I load my data in a np.array X. X = np.array(data) P, D, Q = np.linalg.svd(X, full_matrices=False) D = np.diag(D) But i do not get the above equality when checking with X_a = np.dot(np.dot(P, D), Q.T) X_a and X are the same dimensions, but the values are not the same. Am I missing something, or is the functionality of the np.linalg.svd function not compatible somehow with the equation in the paper? TL;DR: numpy's

Pass PCA preprocessing arguments to train()

落爺英雄遲暮 提交于 2019-12-03 10:32:43
I'm trying to build a predictive model in caret using PCA as pre-processing. The pre-processing would be as follows: preProc <- preProcess(IL_train[,-1], method="pca", thresh = 0.8) Is it possible to pass the thresh argument directly to caret's train() function? I've tried the following, but it doesn't work: modelFit_pp <- train(IL_train$diagnosis ~ . , preProcess="pca", thresh= 0.8, method="glm", data=IL_train) If not, how can I pass the separate preProc results to the train() function? As per the documentation, you specify additional preprocessing arguments with trainControl ?trainControl ..

Is there good library to do nonnegative matrix factorization (NMF) fast?

爷,独闯天下 提交于 2019-12-03 10:12:48
问题 I have a sparse matrix whose shape is 570000*3000 . I tried nima to do NMF (using the default nmf method, and set max_iter to 65). However, I found nimfa very slow. Have anyone used a faster library to do NMF? 回答1: I have used libNMF before. It's written in C and is very fast. There is a paper documenting the algorithm and code. The paper also lists several alternative packages for NMF (in bunch of different languages (which I have copied here for future reference). The Mathworks [3, 33]