Today i\'m trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i\'m looking for the right k... I found the elbow criterion as a
This answer is inspired by what OmPrakash has written. This contains code to plot both the SSE and Silhouette Score. What I've given is a general code snippet you can follow through in all cases of unsupervised learning where you don't have the labels and want to know what's the optimal number of cluster. There are 2 criterion. 1) Sum of Square errors (SSE) and Silhouette Score. You can follow OmPrakash's answer for the explanation. He's done a good job at that.
Assume your dataset is a data frame df1. Here I have used a different dataset just to show how we can use both the criterion to help decide optimal number of cluster. Here I think 6 is the correct number of cluster. Then
range_n_clusters = [2, 3, 4, 5, 6,7,8]
elbow = []
ss = []
for n_clusters in range_n_clusters:
#iterating through cluster sizes
clusterer = KMeans(n_clusters = n_clusters, random_state=42)
cluster_labels = clusterer.fit_predict(df1)
#Finding the average silhouette score
silhouette_avg = silhouette_score(df1, cluster_labels)
ss.append(silhouette_avg)
print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)`
#Finding the average SSE"
elbow.append(clusterer.inertia_) # Inertia: Sum of distances of samples to their closest cluster center
fig = plt.figure(figsize=(14,7))
fig.add_subplot(121)
plt.plot(range_n_clusters, elbow,'b-',label='Sum of squared error')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.legend()
fig.add_subplot(122)
plt.plot(range_n_clusters, ss,'b-',label='Silhouette Score')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.legend()
plt.show()