pca

C++ - framework for computing PCA (other than armadillo)

梦想的初衷 提交于 2019-12-07 22:37:54
问题 I have a large dataset of around 200000 data points where each data point contains 132 features. So basically my dataset is 200000 x 132 . I have done all the computations by using the armadillo framework. However, I have tried to do PCA analysis but I received a memory error which I don't know that it's because of my RAM memory( 8 GB of Ram ) or its a limitation due to the framework itself. I receive the following error : requested size is too large . Can you recommend me another framework

How to calculate the volume of the intersection of ellipses in r

↘锁芯ラ 提交于 2019-12-07 12:43:41
问题 I was wondering how to calculate the intersection between two ellipses e.g. the volume of the intersection between versicolor and virginca as illustrated in this graph: which is plotted using the following mwe based on this tutorial: data(iris) log.ir <- log(iris[, 1:4]) ir.species <- iris[, 5] ir.pca <- prcomp(log.ir, center = TRUE, scale. = TRUE) library(ggbiplot) g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, groups = ir.species, ellipse = TRUE, circle = TRUE) g <- g + scale_color

How to find most contributing features to PCA?

你离开我真会死。 提交于 2019-12-07 05:56:57
问题 I am running PCA on my data (~250 features) and see that all points are clustered in 3 blobs. Is it possible to see which of the 250 features have been most contributing to the outcome? if so how? (using the Scikit-learn implementation) 回答1: Let's see what wikipedia says: PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate

R: ggfortify: “Objects of type prcomp not supported by autoplot”

五迷三道 提交于 2019-12-07 04:38:34
问题 I am trying to use ggfortify to visualize the results of a PCA I did using prcomp. sample code: iris.pca <- iris[c(1, 2, 3, 4)] autoplot(prcomp(iris.pca)) Error: Objects of type prcomp not supported by autoplot. Please use qplot() or ggplot() instead. What is odd is that autoplot is specifically designed to handle the results of prcomp - ggplot and qplot can't handle objects like this. I'm running R version 3.2 and just downloaded ggfortify off of github this AM. Can anyone explain this

classification: PCA and logistic regression using sklearn

浪子不回头ぞ 提交于 2019-12-07 00:48:36
问题 Step 0: Problem description I have a classification problem, ie I want to predict a binary target based on a collection of numerical features, using logistic regression, and after running a Principal Components Analysis (PCA). I have 2 datasets: df_train and df_valid (training set and validation set respectively) as pandas data frame, containing the features and the target. As a first step, I have used get_dummies pandas function to transform all the categorical variables as boolean. For

矩阵特征值分解与奇异值分解含义解析及应用

ぃ、小莉子 提交于 2019-12-06 22:03:10
特征值与特征向量的几何意义 矩阵的乘法是什么,别只告诉我只是“前一个矩阵的行乘以后一个矩阵的列”,还会一点的可能还会说“前一个矩阵的列数等于后一个矩阵的行数才能相乘”,然而,这里却会和你说——那都是表象。 矩阵乘法真正的含义是变换,我们学《线性代数》一开始就学行变换列变换,那才是线代的核心——别会了点猫腻就忘了本——对,矩阵乘法 就是线性变换,若以其中一个向量A为中心,则B的作用主要是使A发生如下变化: 1、伸缩 clf; A = [0, 1, 1, 0, 0;... 1, 1, 0, 0, 1]; % 原空间 B = [3 0; 0 2]; % 线性变换矩阵 plot(A(1,:),A(2,:), '-*');hold on grid on;axis([0 3 0 3]); gtext('变换前'); Y = B * A; plot(Y(1,:),Y(2,:), '-r*'); grid on;axis([0 3 0 3]); gtext('变换后'); 从上图可知,y方向进行了2倍的拉伸,x方向进行了3倍的拉伸,这就是B=[3 0; 0 2]的功劳,3和2就是伸缩比例。请注意,这时B除了对角线元素为各个维度的倍数外,非正对角线元素都为0,因为下面将要看到,对角线元素非0则将会发生切变及旋转的效果。 2、切变 clf; A = [0, 1, 1, 0, 0;... 1, 1, 0

Using Principal Components Analysis (PCA) on binary data

懵懂的女人 提交于 2019-12-06 17:05:47
问题 I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data. Thank you. 回答1:

python3(五)无监督学习

拟墨画扇 提交于 2019-12-06 16:45:46
无监督学习 目录 1 关于机器学习 2 sklearn库中的标准数据集及基本功能 2.1 标准数据集 2.2 sklearn库的基本功能 3 关于无监督学习 4 K-means方法及应用 5 DBSCAN方法及应用 6 PCA方法及其应用 7 NMF方法及其实例 8 基于聚类的“图像分割” 正文 回到顶部 1 关于机器学习    机器学习是实现人工智能的手段, 其主要研究内容是如何利用数据或经验进行学习, 改善具体算法的性能      多领域交叉, 涉及概率论、统计学, 算法复杂度理论等多门学科      广泛应用于网络搜索、垃圾邮件过滤、推荐系统、广告投放、信用评价、欺诈检测、股票交易和医疗诊断等应用   机器学习的分类      监督学习 (Supervised Learning)       从给定的数据集中学习出一个函数, 当新的数据到来时, 可以根据这个函数预测结果, 训练集通常由人工标注      无监督学习 (Unsupervised Learning)       相较于监督学习, 没有人工标注      强化学习 (Reinforcement Learning,增强学习)       通过观察通过什么样的动作获得最好的回报, 每个动作都会对环境有所影响, 学习对象通过观察周围的环境进行判断      半监督学习 (Semi-supervised Learning)

Obtain unstandardized factor scores from factor analysis in R

﹥>﹥吖頭↗ 提交于 2019-12-06 16:19:51
I'm conducting a factor analysis of several variables in R using factanal() (but am open to using other packages). I want to determine each case's factor score, but I want the factor scores to be unstandardized and on the original metric of the input variables. When I run the factor analysis and obtain the factor scores, they are standardized with a normal distribution of mean=0, SD=1, and are not on the original metric of the input variables. How can I obtain unstandardized factor scores that have the same metric as the input variables? Ideally, this would mean a similar mean, sd, range, and

Does partial fit runs in parallel in sklearn.decomposition.IncrementalPCA?

冷暖自知 提交于 2019-12-06 15:46:12
I've followed Imanol Luengo 's answer to build a partial fit and transform for sklearn.decomposition.IncrementalPCA . But for some reason, it looks like (from htop) it uses all CPU cores at maximum. I could find neither n_jobs parameter nor anything related to multiprocessing. My question is: if this is default behavior of these functions how can I set the number of CPU's and where can I find information about it? If not, obviously I am doing something wrong in previous sections of my code. PS: I need to limit the number of CPU cores because using all cores in a server causing a lot of trouble