Python K-means clustering on document [closed]

问题

Python code:

subject1=['data mining','web mining','electronic engineering','cloud computing','Smart Biomaterials'，'Mathematical modeling']
subject2=['Computer Science','Engineering','Biology']

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                               min_df=0.2, stop_words='english',
                               use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(subject1)
print(tfidf_matrix)
km = KMeans(n_clusters=3)
km.fit(tfidf_matrix)
cen = km.cluster_centers_
label = km.labels_

for i in  tfidf_matrix:
print() #should be 'computer science: web mining, data mining, cloud computing'

subject 1 refer to specific area and subject 2 refer to general area. i try to cluster the subject 1 by applying K-means into three cluster to match with subject 2.i don't know what i miss.

回答1:

It is not really clear what you want to achieve here. In order to use the k-means algorithm, you need to come clear about two basic questions:

What is your input data? The k-means algorithm usually works on only one set of data objects, while each object can be defined by multiple attributes. So you need to decide, if you want to perform clustering only on subject1 or if you want to integrate information from subject2 e.g. by adding attributes to the items from subject1.
What is your distance measure? The crucial part of k-means is finding nearest centroids, which requires a meaningful distance measure for your data. This might be a simple character-based distance or a more special measure based on your data's features. The important thing is that your distance measure represents the aspects of your data that make to items similar.

If you want to assign certain labels to your clusters (subject2?), this would be done after performing the regular k-means algorithm e.g. by introspection of the found clusters.

This is a very general guideline of how to approach the application of this algorithm. If you provide more detailed information on what you have and what you want to achieve, we might be able to give better assistance.

回答2:

What you actually seem to be looking for seems to be a topic clustering/semantic mining. i.e: grouping word-pairs/groups based on a general common area. So instead you want to look into NLP areas like topic modeling. and semantic similarity .

来源：https://stackoverflow.com/questions/38886584/python-k-means-clustering-on-document

标签

python

cluster-analysis

k-means