k-means

Cluster unseen points using Spectral Clustering

僤鯓⒐⒋嵵緔 提交于 2019-12-06 01:23:40
I am using Spectral Clustering method to cluster my data. The implementation seems to work properly. However, I have one problem - I have a set of unseen points (not present in the training set) and would like to cluster these based on the centroids derived by k-means (Step 5 in the paper). However, the k-means is computed on the k eigenvectors and therefore the centroids are low-dimensional. Does any-one knows a method that can be used to map an unseen point to a low-dimension and compute the distance between the projected point and the centroids derived by k-means in step 5. Late answer, but

My own K-means algorithm in R

假装没事ソ 提交于 2019-12-06 00:29:28
I am a beginner at R programming and I am doing this exercise in R as an intro to programming. I have made my own K means implementation in R, but have been stuck for a while at a one point: I need to make a consensus, where the algorithm iterates until it finds the optimal center of each cluster. This is the raw algorithm without iteration. It just take a random data point from the whole data as a center, which number is defined by k. Centroid_test=data[sample(nrow(data), k), ] x = Centroid_test y = data m=apply(data,1,function(data) (apply(Centroid_test,1,function(Centroid_test,y) dist(rbind

Sklearn.KMeans : how to avoid Memory or Value Error?

懵懂的女人 提交于 2019-12-06 00:11:39
问题 I'm working on an image classification problem and I'm creating a bag of words model. To do that, I extracted the SIFT descriptors of all my images and I have to use the KMeans algorithm to find the centers to use as my bag of words. Here is the data I have: Number of images: 1584 Number of SIFT descriptors (vector of 32 elements): 571685 Number of center: 15840 So I ran a KMeans algorithm to compute my centers: dico = pickle.load(open('./dico.bin', 'rb')) # np.shape(dico) = (571685, 32) k =

Can K-means be used to help in pixel-value based separation of an image?

纵饮孤独 提交于 2019-12-05 23:15:20
问题 I'm trying to separate a greylevel image based on pixel-value: suppose pixels from 0 to 60 in one bin, 60-120 in another, 120-180 ... and so on til 255. The ranges are roughly equispaced in this case. However by using K-means clustering will it be possible to get more realistic measures of what my pixel value ranges should be? Trying to obtain similar pixels together and not waste bins where there is lower concentration of pixels present. EDITS (to include obtained results): k-means with no

How to label k-means clusters in r

天涯浪子 提交于 2019-12-05 20:41:11
The wikibook on kmeans clustering ( http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means ) gives an example cluster analysis : Can the code be amended so that a label is generated from each cluster? Below graph does not indicate what is being compared. There are three clusters but what are the names of each cluster ? Here is the code that generates the graph : # import data (assume that all data in "data.txt" is stored as comma separated values) x <- read.csv("data.txt", header=TRUE, row.names=1) # run K-Means km <- kmeans(x, 3, 15) # print components of km print(km) #

Microsoft SQL and R, stored procedure and k-means

五迷三道 提交于 2019-12-05 20:35:44
I am new here, hope to help and be helped. However, I am working on the new Microsoft Sql Server Management Studio (2016), using its new features that imply the integration with R. First of all, my goal is to create a stored procedure that perform a K-Means clustering with x and y column. The problem is that I am stuck in the middle, because I am not able to decline the online documentation to my case. Here the script CREATE TABLE [dbo].[ModelTable] ( column_name1 varchar(8000) ) ; CREATE TABLE [dbo].[ResultTable] ( column_name1 varchar(8000), column_name2 varchar(8000), column_name3 varchar

用户购买行为分析(K-means)

冷暖自知 提交于 2019-12-05 20:23:46
Kaggle数据来源 import pandas as pd from sklearn.decomposition import PCA from sklearn.cluster import KMeans import matplotlib.pyplot as plt from sklearn.metrics import silhouette_score # 读取数据,四张表格 prior = pd.read_csv("order_products__prior.csv") products = pd.read_csv("products.csv") orders = pd.read_csv("orders.csv") aisles = pd.read_csv("aisles.csv") # 合并四张表 _mg = pd.merge(prior, products, on=["product_id", "product_id"]) _mg = pd.merge(_mg, orders, on=["order_id", "order_id"]) mt = pd.merge(_mg, aisles, on=["aisle_id", "aisle_id"]) mt.head() */ /*--> */ order_id product_id add_to_cart_order

Python scikit-learn KMeans is being killed (9) while computing silhouette score

末鹿安然 提交于 2019-12-05 19:41:04
I'm currently working on an image dataset (250 000 images, so just as much as features vectors, everyone of them composed of 132 features) and trying to use the KMeans function provided by sklearn. I run it on Mac OS X 10.10, Python 2.7 and sklearn 0.15.2, and after a while I only obtain a: Killed: 9 Error when running these command lines: nb_cls = int(raw_input("Number of clusters chosen :")) clusterer = sklearn.cluster.KMeans(n_clusters=nb_cls) clusters_labels = clusterer.fit_predict(X) silhouette = sklearn.metrics.silhouette_score(X, clusters_labels) print "n clusters =", nb_cls, "/

Clustering Time Series Data of Different Length

天涯浪子 提交于 2019-12-05 18:38:33
I have time series data of different length of series. I want to cluster based upon DTW distance but could not find ant library regarding it. sklearn give straight error while tslearn kmeans gave wrong answer. My problem is solving if I pad it with zeros but I am not sure if this is correct to pad time-series data while clustering. The suggestion about other clustering technique about time series data are welcomed. max_length = 0 for i in train_1: if(len(i)>max_length): max_length = len(i) print(max_length) train_1 = sequence.pad_sequences(train_1, maxlen=max_length) km3 = TimeSeriesKMeans(n

Spark MLlib / K-Means intuition

寵の児 提交于 2019-12-05 17:50:26
I'm very new to machine learning algorithms and Spark. I'm follow the Twitter Streaming Language Classifier found here: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html Specifically this code: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala Except I'm trying to run it in batch mode on some tweets it pulls out of Cassandra, in this case 200 total tweets. As the example shows, I am using this object for