pca

Python PCA plot using Hotelling's T2 for a confidence interval

有些话、适合烂在心里 提交于 2019-12-05 15:25:47
I am trying to apply PCA for Multi variant Analysis and plot the score plot for first two components with Hotelling T2 confidence ellipse in python. I was able to get the scatter plot and I want to add 95% confidence ellipse to the scatter plot. It would be great if anyone know how it can be done in python. Sample picture of expected output: This was bugging me, so I adopted an answer from PCA and Hotelling's T^2 for confidence intervall in R in python (and using some source code from the ggbiplot R package) from sklearn import decomposition from sklearn.preprocessing import StandardScaler

PySpark PCA: avoiding NotConvergedException

。_饼干妹妹 提交于 2019-12-05 13:13:05
I'm attempting to reduce a wide dataset (51 features, ~1300 individuals) using PCA through the ml.linalg method as follows: 1) Named my columns as one list: features = indi_prep_df.select([c for c in indi_prep_df.columns if c not in{'indi_nbr','label'}]).columns 2) Imported the necessary libraries from pyspark.ml.feature import PCA as PCAML from pyspark.ml.linalg import Vector from pyspark.ml.feature import VectorAssembler from pyspark.ml.linalg import DenseVector 3) Collapsed the features to a DenseVector indi_feat = indi_prep_df.rdd.map(lambda x: (x[0], x[-1], DenseVector(x[1:-2]))).toDF([

PCA of RGB Image

南笙酒味 提交于 2019-12-05 11:36:56
I'm trying to figure out how to use PCA to decorrelate an RGB image in python. I'm using the code found in the O'Reilly Computer vision book: from PIL import Image from numpy import * def pca(X): # Principal Component Analysis # input: X, matrix with training data as flattened arrays in rows # return: projection matrix (with important dimensions first), # variance and mean #get dimensions num_data,dim = X.shape #center data mean_X = X.mean(axis=0) for i in range(num_data): X[i] -= mean_X if dim>100: print 'PCA - compact trick used' M = dot(X,X.T) #covariance matrix e,EV = linalg.eigh(M)

R: ggfortify: “Objects of type prcomp not supported by autoplot”

拈花ヽ惹草 提交于 2019-12-05 10:09:57
I am trying to use ggfortify to visualize the results of a PCA I did using prcomp. sample code: iris.pca <- iris[c(1, 2, 3, 4)] autoplot(prcomp(iris.pca)) Error: Objects of type prcomp not supported by autoplot. Please use qplot() or ggplot() instead. What is odd is that autoplot is specifically designed to handle the results of prcomp - ggplot and qplot can't handle objects like this. I'm running R version 3.2 and just downloaded ggfortify off of github this AM. Can anyone explain this message? I'm guessing that you didn't load the required libraries, the code below: library(devtools) install

高光谱遥感图像相关知识梳理大全

て烟熏妆下的殇ゞ 提交于 2019-12-05 07:35:02
前言 ​ 本资料整理了高光谱遥感图像概念定义、分析处理与分类识别的基本知识。第一部分介绍高光谱图像的一般性原理和知识,第二部分介绍了高光谱图像的噪声问题;第三部分介绍高光谱图像数据冗余问题以及数据降维解决冗余的方法;第四部分介绍高光谱图像的混合像元问题,对光谱解混做了一定介绍;第五部分和第六部分分别介绍了高光谱图像的监督分类和分监督分类的特点、流程和常用算法。 1.基本介绍 高光谱遥感(Hyperspectral remote sensing) 是将成像技术和光谱技术相结合的多维信息获取技术,同时探测目标的二维集合空间与一维光谱信息,获取高光谱分辨率的连续、窄波段图像数据。 高光谱图像与高分辨率图像、多光谱图像不同。 高光谱识别优势: 光谱分辨率高、波段众多,能够获取地物几乎连续的光谱特征曲线,并可以根据需要选择或提取特定波段来突出目标特征; 同一空间分辨率下,光谱覆盖范围更宽,能够探测到地物更多对电磁波的响应特征; 波段多,为波段之间的相互校正提供了便利; 定量化的连续光谱曲线数据为地物光谱机理模型引入图像分类提供了条件; 包含丰富的辐射、空间和光谱信息,是多种信息的综合载体。 高光谱在识别方面的困难: 数据量大,图像包含几十个到上百个波段,数据量是单波段遥感图像的几百倍;数据存在大量冗余,处理不当,反而会影响分类精度;

Is it good to normalization/standardization data having large number of features with zeros

有些话、适合烂在心里 提交于 2019-12-05 07:33:19
问题 I'm having data with around 60 features and most will be zeros most of the time in my training data only 2-3 cols may have values( to be precise its perf log data). however, my test data will have some values in some other columns. I've done normalization/standardization(tried both separately) and feed it to PCA/SVD(tried both separately). I used these features in to fit my model but, it is giving very inaccurate results. Whereas, if I skip normalization/standardization step and directly feed

How can I use PCA/SVD in Python for feature selection AND identification?

浪尽此生 提交于 2019-12-05 06:59:48
问题 I'm following Principal component analysis in Python to use PCA under Python, but am struggling with determining which features to choose (i.e. which of my columns/features have the best variance). When I use scipy.linalg.svd , it automatically sorts my Singular Values, so I can't tell which column they belong to. Example code: import numpy as np from scipy.linalg import svd M = [ [1, 1, 1, 1, 1, 1], [3, 3, 3, 3, 3, 3], [2, 2, 2, 2, 2, 2], [9, 9, 9, 9, 9, 9] ] M = np.transpose(np.array(M)) U

classification: PCA and logistic regression using sklearn

六眼飞鱼酱① 提交于 2019-12-05 05:39:01
Step 0: Problem description I have a classification problem, ie I want to predict a binary target based on a collection of numerical features, using logistic regression, and after running a Principal Components Analysis (PCA). I have 2 datasets: df_train and df_valid (training set and validation set respectively) as pandas data frame, containing the features and the target. As a first step, I have used get_dummies pandas function to transform all the categorical variables as boolean. For example, I would have: n_train = 10 np.random.seed(0) df_train = pd.DataFrame({"f1":np.random.random(n

PCA主成分分析(最大投影方差)

江枫思渺然 提交于 2019-12-05 01:09:46
PCA简介: 从n维数据中提取最能代表这组数据的m个向量,也就是对数据进行降维(n->m),提取特征。 目标: 找到一个向量 \(\mu\) ,使n个点在其上的投影的方差最大(投影后的数据越不集中,就说明每个向量彼此之间包含的相似信息越少,从而实现数据降维) 前提假设: 总的数据: \[A = (x_1, x_2, \cdots , x_n)\] \(X\) 的协方差: \[C = Cov(X) = \frac{1}{n}\sum_{i=1}^n(x_i-\overline{x})(x_i-\overline{x})^T\] 向量 \(\mu\) : \[|\mu| = 1 \Rightarrow \mu^T\mu = 1\] 证明: 易知 \(x_i\) 在 \(\mu\) 上的投影为 \[(x_i-\overline{x})^T\cdot\mu\] 因为 \((x_i-\overline{x})\) 均值为0, 所以记其方差 \(J\) 为 \[\frac{1}{n}\sum_{i=1}^n(x_i-\overline{x})^T\cdot\mu)^2\] 又因为上式平方项中为标量,故可以将 \(J\) 改写为 \[\frac{1}{n}\sum_{i=1}^n(x_i-\overline{x})^T\cdot\mu)^T\cdot(x_i-\overline{x})^T

Using Principal Components Analysis (PCA) on binary data

不问归期 提交于 2019-12-04 20:42:32
I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data. Thank you. The principal components of 0/1 data can fall off slowly or rapidly, and the PCs of continuous data too — it