k-means

ML刻意练习第六周之K-means

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-20 16:52:42
k-means算法是无监督学习方法的经典算法之一,也是最简单的一个。 其中我们需要选择一种距离度量来表示数据点之间的距离,本文中我们使用的是欧式距离。 一、k均值聚类算法 1.支持函数 import numpy as np def loadDataSet ( fileName ) : """ 函数说明:加载数据 Parameters: fileName - 文件名 Returns: dataMat - 数据矩阵 """ dataMat = [ ] fr = open ( fileName ) for line in fr . readlines ( ) : curLine = line . strip ( ) . split ( '\t' ) fltLine = list ( map ( float , curLine ) ) # 转化为float类型 dataMat . append ( fltLine ) return np . array ( dataMat ) def distEclud ( vecA , vecB ) : """ 函数说明:欧拉距离 parameters: vecA,vecB:两个数据点的特征向量 returns: 欧式距离 """ return np . sqrt ( np . sum ( np . power ( vecA - vecB , 2 ) )

How would I implement k-means with TensorFlow?

流过昼夜 提交于 2019-12-20 08:48:06
问题 The intro tutorial, which uses the built-in gradient descent optimizer, makes a lot of sense. However, k-means isn't just something I can plug into gradient descent. It seems like I'd have to write my own sort of optimizer, but I'm not quite sure how to do that given the TensorFlow primitives. What approach should I take? 回答1: (note: You can now get a more polished version of this code as a gist on github.) you can definitely do it, but you need to define your own optimization criteria (for k

What makes the distance measure in k-medoid “better” than k-means?

落花浮王杯 提交于 2019-12-20 08:10:54
问题 I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers. I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.

How to select which columns are good for visualisation in k-Means clustering algorithm?

吃可爱长大的小学妹 提交于 2019-12-20 06:42:07
问题 I am trying to understand the selection of columns in csv file which should be taken into considerations to apply k-means . In the below link only annual income and spending score is taken as a column (from Mall_Customers.csv file) for visualisation and not age. https://www.kaggle.com/shrutimechlearn/step-by-step-kmeans-explained-in-detail Please help. 回答1: They have 3 features that they can use to cluster. Usually they will just take the euclidean distance of all the features to get the

How to reduce memory usage within Prado's k-means framework used on big data in R?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-20 06:17:44
问题 I am trying to validate Prado's k-means framework for clustering trading strategies based on returns correlation matrix as found in his paper, using R for a large number of strategies, say 1000. He tries to find optimal k and optimal initialization for k-means using two for loops over all possible k 's and a number of initializations, i.e. k 's go from 2 to N-1 , where N is number of strategies. The issue is that running k-means that many times and especially with that many clusters is memory

Pyspark - ValueError: could not convert string to float / invalid literal for float()

一曲冷凌霜 提交于 2019-12-20 04:38:19
问题 I am trying to use data from a spark dataframe as the input for my k-means model. However I keep getting errors. (Check section after code) My spark dataframe and looks like this (and has around 1M rows): ID col1 col2 Latitude Longitude 13 ... ... 22.2 13.5 62 ... ... 21.4 13.8 24 ... ... 21.8 14.1 71 ... ... 28.9 18.0 ... ... ... .... .... Here is my code: from pyspark.ml.clustering import KMeans from pyspark.ml.linalg import Vectors df = spark.read.csv("file.csv") spark_rdd = df.rdd.map

Error Package KlaR kmodes : Error: Column index must be at most 5 if positive, not 6

筅森魡賤 提交于 2019-12-20 03:52:12
问题 Applying the klaR kmodes algorith to the below dataset > summary(raw) CREDIT_LIMIT CP gender IE_CHILD_NB IE_TOT_DEP_NB TOTAL_INCOME IE_HOUSE_CHARGE maritial >2000 : 612 11500 : 145 MM: 5435 0:7432 0:1446 >2000 :3524 >2000 : 2 D : 1195 0-500 :10458 11100 : 90 MR:12983 1:4119 1:3748 0-500 :1503 0-500 :17146 M :10507 1000-1500: 2912 08830 : 71 2:5787 2:3386 1000-1500:6649 1000-1500: 44 MISS: 1446 1500-2000: 2254 11406 : 68 3: 947 3:3740 1500-2000:4116 1500-2000: 5 Ot : 1043 500-1000 : 2182 35018

Spark::KMeans calls takeSample() twice?

Deadly 提交于 2019-12-20 03:45:09
问题 I have many data and I have experimented with partitions of cardinality [20k, 200k+]. I call it like that: from pyspark.mllib.clustering import KMeans, KMeansModel C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None) C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None) and I see that initRandom() calls takeSample() once. Then the takeSample() implementation doesn't seem to call itself or something like that, so I would

K-Medoids / K-Means Algorithm. Data point with the equal distances between two or more cluster representatives

…衆ロ難τιáo~ 提交于 2019-12-19 11:24:13
问题 I have been researching and studying about partition-based clustering algorithms like K-means and K-Medoids. I have learned that K-medoids is more robust to outliers compared to K-means. However I am curious on what will happen if during the assigning of data points, two or more cluster representatives have the same distance on a data point. Which cluster will you assign the data point? Will the assignment of the data point to a cluster greatly affect the clustering results? 回答1: To prevent

Spark KMeans clustering: get the number of sample assigned to a cluster

◇◆丶佛笑我妖孽 提交于 2019-12-19 09:09:16
问题 I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it. Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training