pca

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

梦想与她 提交于 2019-12-03 03:54:35
问题 I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it? Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework. 回答1: You could try the ANN library, but that only gives reliable results up to 20 dimensions. 回答2: There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I

PCA projection and reconstruction in scikit-learn

早过忘川 提交于 2019-12-03 03:45:46
问题 I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns. from sklearn.decomposition import PCA pca = PCA(n_components=30) X_train_pca = pca.fit_transform(X_train) Now, when I want to project the eigenvectors onto feature space, I must do following: """ Projection """ comp = pca.components_ #30x104 com_tr = np.transpose(pca.components_) #104x30 proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30 But I am hesitating with this step, because Scikit

PCA Dimensionality Reduction

和自甴很熟 提交于 2019-12-03 03:39:07
I am trying to perform PCA reducing 900 dimensions to 10. So far I have: covariancex = cov(labels); [V, d] = eigs(covariancex, 40); pcatrain = (trainingData - repmat(mean(traingData), 699, 1)) * V; pcatest = (test - repmat(mean(trainingData), 225, 1)) * V; Where labels are 1x699 labels for chars (1-26). trainingData is 699x900, 900-dimensional data for the images of 699 chars. test is 225x900, 225 900-dimensional chars. Basically I want to reduce this down to 225x10 i.e. 10 dimensions but am kind of stuck at this point. The covariance is supposed to implemented in your trainingData : X =

Dimension of data before and after performing PCA

喜你入骨 提交于 2019-12-03 03:29:53
I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn. After removing labels from the training data, I add each row in CSV into a list like this: for row in csv: train_data.append(np.array(np.int64(row))) I do the same for the test data. I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?): def preprocess(train_data, test_data, pca_components=100): # convert to matrix train_data = np.mat(train_data) # reduce both train and test data pca = decomposition.PCA(n_components=pca_components).fit(train_data) X_train = pca

R - how to make PCA biplot more readable

佐手、 提交于 2019-12-03 03:20:00
问题 I have a set of observations with 23 variables. When I use prcomp and biplot to plot the results I run into several problems: the actual plot only occupies half of the frame (x < 0), but the plot is centered on 0, so half of space is wasted two variables clearily dominate the results, so all other arrows are clumped together and I can't read a thing ad 1. I tried setting xlim and/or ylim, but I'm obviously doing something wrong since the plot is all messed up when I do ad 2. Can I just

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

匿名 (未验证) 提交于 2019-12-03 03:05:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows: pca = PCA(k=3, inputCol="features", outputCol="pca_features") model = pca.fit(data) where data is a Spark DataFrame with one column labed features wich is a DenseVector of 3 dimensions: data.take(1) Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1') After fitting, I transform the data: transformed = model.transform(data) transformed.first() Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u

PCA Dimensionality Reduction

匿名 (未验证) 提交于 2019-12-03 02:50:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying to perform PCA reducing 900 dimensions to 10. So far I have: covariancex = cov(labels); [V, d] = eigs(covariancex, 40); pcatrain = (trainingData - repmat(mean(traingData), 699, 1)) * V; pcatest = (test - repmat(mean(trainingData), 225, 1)) * V; Where labels are 1x699 labels for chars (1-26). trainingData is 699x900, 900-dimensional data for the images of 699 chars. test is 225x900, 225 900-dimensional chars. Basically I want to reduce this down to 225x10 i.e. 10 dimensions but am kind of stuck at this point. 回答1: The covariance

Weka&#039;s PCA is taking too long to run

匿名 (未验证) 提交于 2019-12-03 02:45:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying to use Weka for feature selection using PCA algorithm. My original feature space contains ~9000 attributes, in 2700 samples. I tried to reduce dimensionality of the data using the following code: AttributeSelection selector = new AttributeSelection(); PrincipalComponents pca = new PrincipalComponents(); Ranker ranker = new Ranker(); selector.setEvaluator(pca); selector.setSearch(ranker); Instances instances = SamplesManager.asWekaInstances(trainSet); try { selector.SelectAttributes(instances); return SamplesManager.asSamplesList

OpenCV PCA Compute in Python

匿名 (未验证) 提交于 2019-12-03 02:06:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm loading a set of test images via OpenCV (in Python) which are 128x128 in size, reshape them into vectors (1, 128x128) and put them all together in a matrix to calculate PCA. I'm using the new cv2 libaries... The code: import os import cv2 as cv import numpy as np matrix_test = None for image in os.listdir('path_to_dir'): imgraw = cv.imread(os.path.join('path_to_dir', image), 0) imgvector = imgraw.reshape(128*128) try: matrix_test = np.vstack((matrix_test, imgvector)) except: matrix_test = imgvector # PCA mean, eigenvectors = cv

Invalid parameter clf for estimator Pipeline in sklearn

匿名 (未验证) 提交于 2019-12-03 01:36:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Could anyone check problems with the following code? Am I wrong in any steps in building my model? I already added two 'clf__' to parameters. clf=RandomForestClassifier() pca = PCA() pca_clf = make_pipeline(pca, clf) kfold = KFold(n_splits=10, random_state=22) parameters = {'clf__n_estimators': [4, 6, 9], 'clf__max_features': ['log2', 'sqrt','auto'],'clf__criterion': ['entropy', 'gini'], 'clf__max_depth': [2, 3, 5, 10], 'clf__min_samples_split': [2, 3, 5], 'clf__min_samples_leaf': [1,5,8] } grid_RF=GridSearchCV(pca_clf,param_grid=parameters,