pca

Principal component analysis (PCA) of time series data: spatial and temporal pattern

不想你离开。 提交于 2019-12-02 19:39:31
Suppose I have yearly precipitation data for 100 stations from 1951 to 1980. In some papers, I find people apply PCA to the time series and then plot the spatial loadings map (with values from -1 to 1), and also plot the time series of the PCs. For example, figure 6 in https://publicaciones.unirioja.es/ojs/index.php/cig/article/view/2931/2696 is the spatial distribution of the PCs. I am using function prcomp in R and I wonder how I can do the same thing. In other words, how can I extract the "spatial pattern" and "temporal pattern" from the results of prcomp function? Thanks. set.seed(1234)

机器学习--PCA算法代码实现(基于Sklearn的PCA代码实现)

半腔热情 提交于 2019-12-02 19:23:57
一、基于Sklearn的PCA代码实现 import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.decomposition import PCA digits = datasets.load_digits() X = digits.data y = digits.target X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666) knn_clf = KNeighborsClassifier() knn_clf.fit(X_train, y_train) pca = PCA(n_components=2) pca.fit(X) X_reduction = pca.transform(X) for i in range(10): plt.scatter(X_reduction[y==i,0], X_reduction[y==i,1], alpha=0.8)

Test significance of clusters on a PCA plot

不羁岁月 提交于 2019-12-02 18:23:41
Is it possible to test the significance of clustering between 2 known groups on a PCA plot? To test how close they are or the amount of spread (variance) and the amount of overlap between clusters etc. You could use a PERMANOVA to partition the euclidean distance by your groups: data(iris) require(vegan) # PCA iris_c <- scale(iris[ ,1:4]) pca <- rda(iris_c) # plot plot(pca, type = 'n', display = 'sites') cols <- c('red', 'blue', 'green') points(pca, display='sites', col = cols[iris$Species], pch = 16) ordihull(pca, groups=iris$Species) ordispider(pca, groups = iris$Species, label = TRUE) #

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

梦想与她 提交于 2019-12-02 18:12:08
I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it? Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework. You could try the ANN library , but that only gives reliable results up to 20 dimensions. Nikita Rybak There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books . In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very

R - how to make PCA biplot more readable

点点圈 提交于 2019-12-02 16:52:16
I have a set of observations with 23 variables. When I use prcomp and biplot to plot the results I run into several problems: the actual plot only occupies half of the frame (x < 0), but the plot is centered on 0, so half of space is wasted two variables clearily dominate the results, so all other arrows are clumped together and I can't read a thing ad 1. I tried setting xlim and/or ylim, but I'm obviously doing something wrong since the plot is all messed up when I do ad 2. Can I just somehow make the arrow labels placed more apart so that I can read them? Or maybe I could just plot the

How to use scikit-learn PCA for features reduction and know which features are discarded

情到浓时终转凉″ 提交于 2019-12-02 16:18:42
I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples. Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way: from sklearn.decomposition import PCA nf = 100 pca = PCA(n_components=nf) # X is the matrix transposed (n samples on the rows, m features on the columns) pca.fit(X) X_new = pca.transform(X) Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones? Thanks The features that your

PCA projection and reconstruction in scikit-learn

情到浓时终转凉″ 提交于 2019-12-02 15:59:08
I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns. from sklearn.decomposition import PCA pca = PCA(n_components=30) X_train_pca = pca.fit_transform(X_train) Now, when I want to project the eigenvectors onto feature space, I must do following: """ Projection """ comp = pca.components_ #30x104 com_tr = np.transpose(pca.components_) #104x30 proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30 But I am hesitating with this step, because Scikit documentation says: components_: array, [n_components, n_features] Principal axes in feature space , representing

Obtain eigen values and vectors from sklearn PCA

久未见 提交于 2019-12-02 15:35:59
How I can get the the eigen values and eigen vectors of the PCA application? from sklearn.decomposition import PCA clf=PCA(0.98,whiten=True) #converse 98% variance X_train=clf.fit_transform(X_train) X_test=clf.transform(X_test) I can't find it in docs . 1.I am "not" able to comprehend the different results here. Edit : def pca_code(data): #raw_implementation var_per=.98 data-=np.mean(data, axis=0) data/=np.std(data, axis=0) cov_mat=np.cov(data, rowvar=False) evals, evecs = np.linalg.eigh(cov_mat) idx = np.argsort(evals)[::-1] evecs = evecs[:,idx] evals = evals[idx] variance_retained=np.cumsum

数据规约

非 Y 不嫁゛ 提交于 2019-12-02 15:18:39
数据规约 再大数据集上进行复杂得数据分析和挖掘需要很长得时间,数据规约产生更小但保持原数据完整性得新数据集,再规约后得数据集上进行分析和挖掘将更有效率 数据规约得意义 降低无效,错误数据对建模得影响,提高建模得准确性 少量且具有代表性得数据将大幅缩减数据挖掘所需得时间 降低存储数据得成本 属性规约 属性规约通过属性合并来创建新属性维数,或者直接通过删除不相关得属性(维)来减少数据维数,从而提高数据挖掘得效率,降低计算成本.属性规约得目标是寻找出最小得属性子集并确保新数据子集得概率分布尽可能地接近原来数据集得概率分布,属性规约常用方法如下表: 属性规约方法 方法描述 方法解析 主要成分分析 用较少变量取解释原始数据中得大部分变量,即将许多相关型很高得变量转化成彼此相互独立或不相关得变量 决策树归纳 利用决策树得归纳方法对初始数据进行分类归纳学习,获得一个初始决策树,所有没有出现再这个决策树上得属性均可认为是无关属性,因此将这些属性从初始集合中删除就可以获得一个较优得属性子集 初始属性集: {A1,A2,A3,A4,A5,A6} 合并属性 将一些旧属性合并为新属性 初始属性集: {A 1 ,A 2 ,A 3 ,A 4 ,B 1 ,B 2 ,B 3 ,C} {A 1 ,A 2 ,A 3 ,A 4 } -→A {B 1 ,B 2 ,B3,B 4 }-→B ===>规约后属性集:{A, B,

手把手教你入门和实践特征工程 的全方位万字笔记,附代码下载

六眼飞鱼酱① 提交于 2019-12-02 06:51:39
🙊 说起特征工程,都说是机器学习建模中最为重要而且费时的一项工作,而且它涉及的知识点会非常地多,经验老道的老司机自然是轻车熟路了,但对于刚刚入门的新手司机,学习到的知识点都是东一点西一点的,不够系统化,本篇文章是在阅读了一本评分极高的特征工程书籍 📚 《特征工程入门与实践》 后的一篇笔记文,记录下相对比较系统的知识点以及可运行复现的代码,希望对各位同行有所帮助哈。 图:强力推荐这本书 🚗 目录 🔍 特征理解 🔋 特征增强 🔨 特征构建 ✅ 特征选择 💫 特征转换 📖 特征学习 大家可以先看下思维导图: 🔍 01 特征理解 在拿到数据的时候,我们第一步需要做的是理解它,一般我们可以从下面几个角度入手: (注:本节用到了两个数据集,分别是Salary_Ranges_by_Job_Classification 和 GlobalLandTemperaturesByCity) 1. 区分结构化数据与非结构化数据 如一些以表格形式进行存储的数据,都是结构化数据;而非结构化数据就是一堆数据,类似于文本、报文、日志之类的。 2. 区分定量和定性数据 定量数据:指的是一些数值,用于衡量某件东西的数量; 定性数据:指的是一些类别,用于描述某件东西的性质。 其实区分了定量和定性数据,还可以继续细分下去,分为 定类(nominal)、定序(ordinal)、定距(interval)、定比数据(ratio)