pca

pca算法实现

北城以北 提交于 2019-12-04 07:11:13
pca基础知识不了解的可以先看下一这篇博客: https://www.cnblogs.com/lliuye/p/9156763.html 具体算法实现如下: 1 import numpy as np 2 import matplotlib.pyplot as plt 3 # 载入数据 4 data = np.genfromtxt("data.csv", delimiter=",") 5 x_data = data[:,0] 6 y_data = data[:,1] 7 plt.scatter(x_data,y_data) 8 plt.show() 9 print(x_data.shape) 10 # 数据中心化 11 def zeroMean(dataMat): 12 # 按列求平均,即各个特征的平均 13 meanVal = np.mean(dataMat, axis=0) 14 newData = dataMat - meanVal 15 return newData, meanVal 16 newData,meanVal=zeroMean(data) 17 print(newData.shape) 18 # np.cov用于求协方差矩阵,参数rowvar=0说明数据一行代表一个样本,若非0,说明传入的数据一列代表一个样本。 19 covMat = np.cov(newData

Change loadings (arrows) length in PCA plot using ggplot2/ggfortify?

…衆ロ難τιáo~ 提交于 2019-12-04 07:01:29
I have been struggling with rescaling the loadings (arrows) length in a ggplot2/ggfortify PCA. I have looked around extensively for an answer to this, and the only information I have found either code new biplot functions or refer to other entirely different packages for PCA (ggbiplot, factoextra), neither of which address the question I would like to answer: Is it possible to scale/change size of PCA loadings in ggfortify? Below is the code I have to plot a PCA using stock R functions as well as the code to plot a PCA using autoplot/ggfortify. You'll notice in the stock R plots I can scale

TypeError grid seach

让人想犯罪 __ 提交于 2019-12-04 05:32:47
问题 I used to create loop for finding the best parameters for my model which increased my errors in coding so I decided to use GridSearchCV . I am trying to find out the best parameters for PCA for my model (the only parameter I want to grid search on). In this model, after normalization I want to combine the original features with the PCA reduced features and then apply the linear SVM. Then I save the whole model to predict my input on. I have an error in the line where I try to fit the data so

04_数据降维

有些话、适合烂在心里 提交于 2019-12-04 04:28:27
04 数据降维 降维: 降低特征的数量 特征选择 主成分分析 特征选择: 特征选择的原因 冗余:部分特征的相关度高,容易消耗计算性能 噪声:部分特征对计算结构有影响 特征选择是什么? 定义: 特征选择就是单纯地从提取到的所有特征中选择部分特征作为训练集特征,特征在选择前和选择后可以改变值,也可以不改变值,但是选择后的特征维数肯定比选择前小。因为我们只选择了其中的一部分特征。 主要方法: Filter (过滤式):Variance Threshold (方差的过滤) Embedded (嵌入式):正则化、决策树 Wrapper (包裹式) VarianceThreshold 模块 from sklearn.feature_selection import VarianceThreshold def var(): """ 特征选择-删除低方差的特征 :return: None """ var = VarianceThreshold(threshold=0.0) #取值根据实际情况 data = var.fit_transform([[0,2,0,3],[0,1,4,3],[0,1,1,3]]) print(data) return None if __name__ == '__main__': var() 主成分分析(PCA Principal Component Analysis)

R - 'princomp' can only be used with more units than variables

ε祈祈猫儿з 提交于 2019-12-04 03:21:24
I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph. "'princomp' can only be used with more units than variables" I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again. Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to. IS there anything I

Is this the right way of projecting the training set into the eigespace? MATLAB

梦想与她 提交于 2019-12-04 02:00:26
问题 I have computed PCA using the following : function [signals,V] = pca2(data) [M,N] = size(data); data = reshape(data, M*N,1); % subtract off the mean for each dimension mn = mean(data,2); data = bsxfun(@minus, data, mean(data,1)); % construct the matrix Y Y = data'*data / (M*N-1); [V D] = eigs(Y, 10); % reduce to 10 dimension % project the original data signals = data * V; My question is: Is "signals" is the projection of the training set into the eigenspace? I saw in "Amir Hossein" code that

How can I use PCA/SVD in Python for feature selection AND identification?

六眼飞鱼酱① 提交于 2019-12-03 21:18:33
I'm following Principal component analysis in Python to use PCA under Python, but am struggling with determining which features to choose (i.e. which of my columns/features have the best variance). When I use scipy.linalg.svd , it automatically sorts my Singular Values, so I can't tell which column they belong to. Example code: import numpy as np from scipy.linalg import svd M = [ [1, 1, 1, 1, 1, 1], [3, 3, 3, 3, 3, 3], [2, 2, 2, 2, 2, 2], [9, 9, 9, 9, 9, 9] ] M = np.transpose(np.array(M)) U,s,Vt = svd(M, full_matrices=False) print s Is there a different way to go about this without the

PCA降维算法

倾然丶 夕夏残阳落幕 提交于 2019-12-03 20:41:24
总结一下PCA的算法步骤: 设有m条n维数据。 1)将原始数据按列组成n行m列矩阵X 2)将X的每一行(代表一个属性字段)进行零均值化,即减去这一行的均值 3)求出协方差矩阵 4)求出协方差矩阵的特征值及对应的特征向量 5)将特征向量按对应特征值大小从上到下按行排列成矩阵,取前k行组成矩阵P 6)Y=PX即为降维到k维后的数据 PCA算法的主要优点有: 仅仅需要以方差衡量信息量,不受数据集以外的因素影响。 各主成分之间正交,可消除原始数据成分间的相互影响的因素。 计算方法简单,主要运算是特征值分解,易于实现。 PCA算法的主要缺点有: 主成分各个特征维度的含义具有一定的模糊性,不如原始样本特征的解释性强。 方差小的非主成分也可能含有对样本差异的重要信息,因降维丢弃可能对后续数据处理有影响。 来源: https://www.cnblogs.com/zjuhaohaoxuexi/p/11808120.html

PCA analysis using Correlation Matrix as input in R

删除回忆录丶 提交于 2019-12-03 20:24:05
Now i have a 7000*7000 correlation matrix and I have to do PCA on this in R. I used the CorPCA <- princomp(covmat=xCor) , xCor is the correlation matrix but it comes out "covariance matrix is not non-negative definite" it is because i have some negative correlation in that matrix. I am wondering which inbuilt function in R that i can use to get the result of PCA Sudeep Juvekar not non-negative definite does not mean the covariance matrix has negative correlations. It's a linear algebra equivalent of trying to take square root of negative number! You can't tell by looking at a few values of the

Configuring biplot in Matlab to distinguish in scatter

心不动则不痛 提交于 2019-12-03 16:22:38
My original data is a 195x22 record set containing vocal measurements of people having Parkinson's disease or not. In a vector, 195x1 , I have a status which is either 1/0. Now, I have performed a PCA and I do a biplot , which turns out well. The problem is that I can't tell which dots from my scatter plot origin of a sick or a healthy person (I can't link it with status ). I would like for my scatter plot to have a red dot if healthy (status=0) and green if sick (status=1). How would I do that? My biplot code is: biplot(coeff(:,1:2), ... 'Scores', score(:,1:2), ... 'VarLabels', Labels, ...