k-means

Inconsistent results with KMeans between Apache Spark and scikit_learn

大憨熊 提交于 2019-12-05 17:31:45
I am performing clustering on a dataset using PySpark. To find the number of clusters I performed clustering over a range of values (2,20) and found the wsse (within-cluster sum of squares) values for each value of k . This where I found something unusual. According to my understanding when you increase the number of clusters, the wsse decreases monotonically. But results I got say otherwise. I 'm displaying wsse for first few clusters only Results from spark For k = 002 WSSE is 255318.793358 For k = 003 WSSE is 209788.479560 For k = 004 WSSE is 208498.351074 For k = 005 WSSE is 142573.272672

Spherical k-means implementation in Python

十年热恋 提交于 2019-12-05 16:51:22
问题 I've been using scipy's k-means for quite some time now, and I'm pretty happy about the way it works in terms of usability and efficiency. However, now I want to explore different k-means variants, more specifically, I'd like to apply spherical k-means in some of my problems. Do you know any good Python implementation (i.e. similar to scipy's k-means) of spherical k-means? If not, how hard would it be to modify scipy's source code to adapt its k-means algorithm to be spherical? Thank you. 回答1

机器学习 - 算法 - 聚类 K-MEANS 算法

坚强是说给别人听的谎言 提交于 2019-12-05 13:36:20
聚类算法 概述 无监督问题   手中无标签 聚类   将相似的东西分到一组 难点   如何 评估 , 如何 调参 基本概念 要得到的簇的个数   - 需要指定 K 值 质心    - 均值, 即向量各维度取平均 距离的度量   - 常用 欧几里得距离 和 余弦线相似度 ( 先标准化 ) 优化目标   - 需求每个簇中的点, 到质心的距离尽可能的加和最小, 从而得到最优 K - MEANS 算法 工作流程 - (a)    初始图 - (b)    在指定了 K 值之后, 会在图中初始化两个点 红点, 蓝点( 随机质心 ) 这里 K 指定为 2 - (c)    然后对图中的每一个点计算是分别到红点以及蓝点的距离, 谁短就算谁的 - (d)    重新将红色蓝色区域计算质心 - (e)    根据重新计算的质心, 再次遍历所有点计算到两个新质点的距离对比划分 - (f)   按照之前的套路再次更新质点 就这样不断的更新下去, 直到所有的样本点都不再发生变化的时候则表示划分成功 优势 简单快速, 适合常规数据集 劣势 K 值难以决定 复杂度与样本呈线性关系 很难发现任意形状的簇 , 如下图 来源: https://www.cnblogs.com/shijieli/p/11925823.html

Trouble with scipy kmeans and kmeans2 clustering in Python

混江龙づ霸主 提交于 2019-12-05 12:33:20
I have a question about scipy's kmeans and kmeans2 . I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans vs kmeans2 . Can you explain why this is? My code is below. First I load my data and plot the coordinates. It all looks correct. import pandas as pd, numpy as np, matplotlib.pyplot as plt from scipy.cluster.vq import kmeans, kmeans2, whiten df = pd.read_csv('data.csv') df.head() coordinates = df.as_matrix(columns=['lon', 'lat']) plt.figure(figsize=(10, 6), dpi=100) plt.scatter

k-means实战-RFM客户价值分群

不问归期 提交于 2019-12-05 12:03:49
数据挖掘的十大算法 基本概念 导入数据集到mysql数据库中 总共有940个独立消费数据 K- Means 算法 K-Means 算法是一个聚类算法。你可以这么理解,最终我想把物体划分成 K 类。假设每 个类别里面,都有个“中心点”,即意见领袖,它是这个类别的核心。现在我有一个新点 要归类,这时候就只要计算这个新点与 K 个中心点的距离,距离哪个中心点近,就变成了 哪个类别。 引入模块 import pandas as pd import numpy as np from sklearn.cluster import KMeans import pymysql 连接数据库: conn = pymysql.connect(host='localhost',user='root',password='123',db='db2',port=3306) rfm = pd.read_sql('select * from consumption_data',con=conn) conn.close() 查看详情: rfm.info() rfm.head() """选取RFM 三列""" new_rfm = rfm.loc[:,['R','F','M']] """调用KMeans算法 进行聚类 ,设定为8类""" clf = KMeans(n_clusters=8,random_state=0)

Implementation of k-means clustering algorithm

独自空忆成欢 提交于 2019-12-05 10:11:44
问题 In my program, i'm taking k=2 for k-mean algorithm i.e i want only 2 clusters. I have implemented in a very simple and straightforward way, still i'm unable to understand why my program is getting into infinite loop. can anyone please guide me where i'm making a mistake..? for simplicity, i hav taken the input in the program code itself. here is my code : import java.io.*; import java.lang.*; class Kmean { public static void main(String args[]) { int N=9; int arr[]={2,4,10,12,3,20,30,11,25};

MNIST | 基于k-means和KNN的0-9数字手写体识别

匆匆过客 提交于 2019-12-05 09:03:48
MNIST | 基于k-means和KNN的0-9数字手写体识别 1 背景说明 2 算法原理 3 代码实现 3.1 文件目录 3.2 核心代码 4 实验与结果分析 5 后记 概要: 本实验是在实验“ kaggle|基于k-means和KNN的语音性别识别 ”、实验“ MNIST|基于朴素贝叶斯分类器的0-9数字手写体识别 ”以及实验“ 算法|k-means聚类 ”的基础上进行的,把k-means聚类和CNN识别应用到数字手写体识别问题中去。有关MINIST数据集和kmeans+KNN的内容可以先看我的上面三篇博文,本实验的代码依然是MATLAB。 关键字: 数字手写体识别; k-means; KNN; MATLAB; 机器学习 1 背景说明    我在我的 上上篇博文 中提到会把kmeans聚类算法用到诸如语音性别识别和0-9数字手写体识别等具体问题中去, 语音性别识别的实验 已经在11月2号完成,现在来填0-9数字手写体识别的坑。由于本篇博客承接了我之前若干篇博客,而MNIST数据集、kmeans以及KNN算法的原理和用法等内容均已在之前提到过,所以这里不再专门说明。 2 算法原理    可以将本次实验思路概括如下:    S1:训练时,将训练集中0-9对应的数据各聚成k类,共计10k个聚类中心;    S2:验证时

OpenCV4Android Kmean doesn't work as expected

空扰寡人 提交于 2019-12-05 06:43:03
问题 This code should give centers mat with 3 rows and clusterCount number of columns Mat reshaped_image = imageMat.reshape(1, imageMat.cols()*imageMat.rows()); Mat reshaped_image32f = new Mat(); reshaped_image.convertTo(reshaped_image32f, CvType.CV_32F, 1.0 / 255.0); Mat labels = new Mat(); TermCriteria criteria = new TermCriteria(TermCriteria.COUNT, 100, 1); Mat centers = new Mat(); int clusterCount = 5, attempts = 1; Core.kmeans(reshaped_image32f, clusterCount, labels, criteria, attempts, Core

K means finding elbow when the elbow plot is a smooth curve

て烟熏妆下的殇ゞ 提交于 2019-12-05 05:35:33
I am trying to plot the elbow of k means using the below code: load CSDmat %mydata for k = 2:20 opts = statset('MaxIter', 500, 'Display', 'off'); [IDX1,C1,sumd1,D1] = kmeans(CSDmat,k,'Replicates',5,'options',opts,'distance','correlation');% kmeans matlab [yy,ii] = min(D1'); %% assign points to nearest center distort = 0; distort_across = 0; clear clusts; for nn=1:k I = find(ii==nn); %% indices of points in cluster nn J = find(ii~=nn); %% indices of points not in cluster nn clusts{nn} = I; %% save into clusts cell array if (length(I)>0) mu(nn,:) = mean(CSDmat(I,:)); %% update mean %% Compute

Extract black objects from color background

眉间皱痕 提交于 2019-12-05 05:24:48
It is easy for human eyes to tell black from other colors. But how about computers? I printed some color blocks on the normal A4 paper. Since there are three kinds of ink to compose a color image, cyan, magenta and yellow, I set the color of each block C=20%, C=30%, C=40%, C=50% and rest of two colors are 0. That is the first column of my source image. So far, no black ( K of CMYK) ink is supposed to print. After that, I set the color of each dot K=100% and rest colors are 0 to print black dots. You may feel my image is weird and awful. In fact, the image is magnified 30 times and how the ink