k-means

K-MEANS算法

有些话、适合烂在心里 提交于 2020-01-01 09:50:26
一、聚类思想 所谓聚类算法是指将一堆没有标签的数据自动划分成几类的方法,属于无监督学习方法,这个方法要保证同一类的数据有相似的特征,如下图所示: 根据样本之间的距离或者说是相似性(亲疏性),把越相似、差异越小的样本聚成一类(簇),最后形成多个簇,使同一个簇内部的样本相似度高,不同簇之间差异性高。 二、k-means聚类分析算法 相关概念: K值:要得到的簇的个数 质心:每个簇的均值向量,即向量各维取平均即可 距离量度:常用欧几里得距离和余弦相似度(先标准化) 算法流程: 1、首先确定一个k值,即我们希望将数据集经过聚类得到k个集合。 2、从数据集中随机选择k个数据点作为质心。 3、对数据集中每一个点,计算其与每一个质心的距离(如欧式距离),离哪个质心近,就划分到那个质心所属的集合。 4、把所有数据归好集合后,一共有k个集合。然后重新计算每个集合的质心。 5、如果新计算出来的质心和原来的质心之间的距离小于某一个设置的阈值(表示重新计算的质心的位置变化不大,趋于稳定,或者说收敛),我们可以认为聚类已经达到期望的结果,算法终止。 6、如果新质心和原质心距离变化很大,需要迭代3~5步骤。 三、数学原理 K-Means采用的启发式方式很简单,用下面一组图就可以形象的描述: 上图a表达了初始的数据集,假设k=2。在图b中,我们随机选择了两个k类所对应的类别质心,即图中的红色质心和蓝色质心

Outlier detection with k-means algorithm

僤鯓⒐⒋嵵緔 提交于 2020-01-01 03:03:48
问题 I am hoping you can help me with my problem. I am trying to detect outliers with use of the kmeans algorithm. First I perform the algorithm and choose those objects as possible outliers which have a big distance to their cluster center. Instead of using the absolute distance I want to use the relative distance, i.e. the ration of absolute distance of the object to the cluster center and the average distance of all objects of the cluster to their cluster center. The code for outlier detection

How do i visualize data points of tf-idf vectors for kmeans clustering?

只谈情不闲聊 提交于 2019-12-31 10:00:28
问题 I have a list of documents and the tf-idf score for each unique word in the entire corpus. How do I visualize that on a 2-d plot to give me a gauge of how many clusters I will need to run k-means? Here is my code: sentence_list=["Hi how are you", "Good morning" ...] vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore') vectorized=vectorizer.fit_transform(sentence_list) num_samples, num_features=vectorized.shape print "num_samples: %d, num_features: %d" %(num

Online k-means clustering

丶灬走出姿态 提交于 2019-12-31 09:12:52
问题 Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time. I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis. Also, does anyone have advice for other online clustering algorithms? (lmgtfy failed ;)) 回答1: Yes there is. Google

Online k-means clustering

送分小仙女□ 提交于 2019-12-31 09:12:14
问题 Is there a online version of the k-Means clustering algorithm? By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time. I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis. Also, does anyone have advice for other online clustering algorithms? (lmgtfy failed ;)) 回答1: Yes there is. Google

Using a smoother with the L Method to determine the number of K-Means clusters

霸气de小男生 提交于 2019-12-31 09:02:21
问题 Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use? The "L-Method" is detailed in: Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan This calculates the evaluation metric

TypeError: object of type 'map' has no len() Python3

廉价感情. 提交于 2019-12-31 05:33:27
问题 I'm trying to implement KMeans algorithm using Pyspark it gives me the above error in the last line of the while loop. it works fine outside the loop but after I created the loop it gave me this error How do I fix this ? # Find K Means of Loudacre device status locations # # Input data: file(s) with device status data (delimited by '|') # including latitude (13th field) and longitude (14th field) of device locations # (lat,lon of 0,0 indicates unknown location) # NOTE: Copy to pyspark using

Error initializing SparkContext: A master URL must be set in your configuration

时光毁灭记忆、已成空白 提交于 2019-12-30 18:07:23
问题 I used this code My error is: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/02/03 20:39:24 INFO SparkContext: Running Spark version 2.1.0 17/02/03 20:39:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/02/03 20:39:25 WARN SparkConf: Detected deprecated memory fraction settings: [spark.storage.memoryFraction]. As of Spark 1.6, execution and storage memory management are

Error initializing SparkContext: A master URL must be set in your configuration

随声附和 提交于 2019-12-30 18:06:45
问题 I used this code My error is: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/02/03 20:39:24 INFO SparkContext: Running Spark version 2.1.0 17/02/03 20:39:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/02/03 20:39:25 WARN SparkConf: Detected deprecated memory fraction settings: [spark.storage.memoryFraction]. As of Spark 1.6, execution and storage memory management are

How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

时光怂恿深爱的人放手 提交于 2019-12-30 11:17:08
问题 I used the below code to create k-means clusters using Scikit learn. kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++') kmean_fit = kmean.fit(clus_data) I also have saved the centroids using kmean_fit.cluster_centers_ I then pickled the K means object. filename = pickle_path+'\\'+'_kmean_fit.sav' pickle.dump(kmean_fit, open(filename, 'wb')) So that I can load the same kmeans pickle object and apply it to new data when