pca

MATLAB实例:PCA降维

梦想与她 提交于 2019-12-01 01:47:20
canopy聚类算法的MATLAB程序 凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/ 1. canopy聚类算法简介 Canopy聚类算法是一个将对象分组到类的简单、快速、精确地方法。每个对象用多维特征空间里的一个点来表示。这个算法使用一个快速近似距离度量和两个距离阈值 T1>T2来处理。基本的算法是,从一个点集合开始并且随机删除一个,创建一个包含这个点的Canopy,并在剩余的点集合上迭代。对于每个点,如果它的距离第一个点的距离小于T1,然后这个点就加入这个聚集中。除此之外,如果这个距离<T2,然后将这个点从这个集合中删除。这样非常靠近原点的点将避免所有的未来处理,不可以再做其它Canopy的中心。这个算法循环到初始集合为空为止,聚集一个集合的Canopies,每个可以包含一个或者多个点。每个点可以包含在多于一个的Canopy中。 Canopy算法其实本身也可以用于聚类,但它的结果可以为之后代价较高聚类提供帮助,其用在数据预处理上要比单纯拿来聚类更有帮助。Canopy聚类经常被用作更加严格的聚类技术的初始步骤,像是K均值聚类。建立canopies之后,可以删除那些包含数据点数目较少的canopy,往往这些canopy是包含孤立点的。 Canopy算法的步骤如下: (1) 将所有数据放进list中,选择两个距离,T1,T2,T1>T2

Matlab: how to find which variables from dataset could be discarded using PCA in matlab?

寵の児 提交于 2019-12-01 01:43:06
问题 I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore: [coeff, PC, eigenvalues] = princomp(zscore(x)) I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns). So I assumed that

Sparse PCA 稀疏主成分分析

旧城冷巷雨未停 提交于 2019-12-01 01:25:42
Sparse PCA 稀疏主成分分析 2016-12-06 16:58:38 qilin2016 阅读数 15677 文章标签: 统计学习算法 更多 分类专栏: Machine Learning 版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。 本文链接: https://blog.csdn.net/zhoudi2010/article/details/53489319 SPCA原始文献: H. Zou (2006) Sparse principal component analysis PCA 可以参考: The Elements of Statistical Learning 第十四章 主成分分析的基本思想以及R的应用可以参考: 稀疏主成分分析与R应用 关于统计学习中的稀疏算法可以参考:Statistical learning with sparsity: the lasso and generalizations 一份很好的文档: http://www.cs.utexas.edu/~rashish/sparse_pca.pdf 首先直接来看算法: 令A初始化为V[,1:k],即为前k个principal components的loading vectors. 对于给定的 A = [ α 1 , … , α k ] A=

Why did PCA reduced the performance of Logistic Regression?

佐手、 提交于 2019-11-30 23:45:10
I performed Logistic regression on a binary classification problem with data of 50000 X 370 dimensions.I got accuracy of about 90%.But when i did PCA + logistic on data, my accuracy reduced to 10%, I was very shocked to see this result. Can anybody explain what could have gone wrong? There is no guarantee that PCA will ever help, or not harm the learning process. In particular - if you use PCA to reduce amount of dimensions - you are removing information from your data, thus everything can happen - if the removed data was redundant, you will probably get better scores, if it was an important

psych: principal - loadings components

为君一笑 提交于 2019-11-30 21:44:51
My question is concerned with the principal() function in psych package. set.seed(0) x <- replicate(8, rnorm(10)) pca.x <- principal(x, nf=4, rotate="varimax") I know if I want to see the loadings table, I can use loading.x <-loadings(pca.x) , than I will have the following results. > loading.x Loadings: RC1 RC3 RC4 RC2 [1,] -0.892 -0.205 0.123 [2,] 0.154 0.158 0.909 [3,] -0.660 0.255 -0.249 0.392 [4,] -0.352 0.412 0.614 -0.481 [5,] 0.950 -0.208 0.117 [6,] -0.302 0.111 0.860 [7,] 0.852 -0.195 -0.358 [8,] -0.109 0.903 0.265 RC1 RC3 RC4 RC2 SS loadings 2.323 1.934 1.373 1.342 Proportion Var 0

SVM Visualization in MATLAB

北战南征 提交于 2019-11-30 20:03:58
How do I visualize the SVM classification once I perform SVM training in Matlab? So far, I have only trained the SVM with: % Labels are -1 or 1 groundTruth = Ytrain; d = xtrain; model = svmtrain(groundTruth, d); Lordalcol If you are using LIBSVM, you can plot classification results: % Labels are -1 or 1 groundTruth = Ytrain; d = xtrain; figure % plot training data hold on; pos = find(groundTruth==1); scatter(d(pos,1), d(pos,2), 'r') pos = find(groundTruth==-1); scatter(d(pos,1), d(pos,2), 'b') % now plot support vectors hold on; sv = full(model.SVs); plot(sv(:,1),sv(:,2),'ko'); % now plot

Adding principal components as variables to a data frame

泪湿孤枕 提交于 2019-11-30 19:00:58
I am working with a dataset of 10000 data points and 100 variables in R. Unfortunately the variables I have do not describe the data in a good way. I carried out a PCA analysis using prcomp() and the first 3 PCs seem to account for a most of the variability of the data. As far as I understand, a principal component is a combination of different variables; therefore it has a certain value corresponding to each data point and can be considered as a new variable. Would I be able to add these principal components as 3 new variables to my data? I would need them for further analysis. A reproducible

《python数据分析和数据挖掘》——数据预处理

心不动则不痛 提交于 2019-11-30 18:25:18
此文为《python数据分析和数据挖掘》的读书笔记 通俗讲,经过我们前期的数据分析,得到了数据的缺陷,那么我们现在要做的就是去对数据进行预处理,可包括四个部分:数据清洗、数据集成、数据变换、数据规约。 处理过程如图所示: 1、数据清洗 1) 缺失值处理: 删除记录、数据插补、不处理。不处理吧总感觉不自在,删除了吧数据又有点舍不得,所以一般插补方法用的比较多,该文重点介绍Lagrange插补法和牛顿插补法,并介绍代码。 偷点懒他的详细过程我截图好了。 a 拉格朗日插补法 b 牛顿插补法 但是由于python中的Scipy库中提供了Lagrange插值法的函数,实现上更为容易,应用较多。而牛顿插值法则需要根据自行编写。需要指出两者给出的结果是相同的(相同次数、相同系数的多项式),不过表现的形式不同而已。 二话不说贴上亲测的python代码: import pandas as pd from scipy.interpolate import lagrange#导入拉格朗日函数 import sys sys.__stdout__=sys.stdout inputfile='catering_sale.xls'#销售数据途径 outputfile='tmp/sales.xls'#输出数据途径 data=pd.read_excel(inputfile,Index_col=u'日期')#读入数据

psych: principal - loadings components

删除回忆录丶 提交于 2019-11-30 17:17:43
问题 My question is concerned with the principal() function in psych package. set.seed(0) x <- replicate(8, rnorm(10)) pca.x <- principal(x, nf=4, rotate="varimax") I know if I want to see the loadings table, I can use loading.x <-loadings(pca.x) , than I will have the following results. > loading.x Loadings: RC1 RC3 RC4 RC2 [1,] -0.892 -0.205 0.123 [2,] 0.154 0.158 0.909 [3,] -0.660 0.255 -0.249 0.392 [4,] -0.352 0.412 0.614 -0.481 [5,] 0.950 -0.208 0.117 [6,] -0.302 0.111 0.860 [7,] 0.852 -0.195

使用Python一步步实现PCA算法

拥有回忆 提交于 2019-11-30 16:57:45
使用Python一步步实现PCA算法 标签: PCA Python 本文原地址为: http://sebastianraschka.com/Articles/2014_pca_step_by_step.html Implementing a Principal Component Analysis (PCA) – in Python, step by step Apr 13, 2014 by Sebastian Raschka 此篇为翻译作品,仅作为学习用途。 简介 主成分分析(PCA)的主要目的是通过分析发现数据的模式进行维度的缩减,这个过程的原则是信息损失最小化。 我们希望得到的结果,把初始特征空间映射到一个相对低维度的子空间,同时保证这个低维度空间也能够很好的表达数据的有效信息。在模式分类中,我们希望通过降维操作抽取能够最佳表达数据的特征子集来降低运算时间花费,减少参数估计的误差。 主成分分析(PCA) vs 多元判别式分析(MDA) PCA和MDA都是线性变换的方法,二者关系密切。在PCA中,我们寻找数据集中最大化方差的成分,在MDA中,我们对类间最大散布的方向更感兴趣。 一句话,通过PCA,我们将整个数据集(不带类别标签)映射到一个子空间中,在MDA中,我们致力于找到一个能够最好区分各类的最佳子集。粗略来讲,PCA是通过寻找方差最大的轴(在一类中