Document Clustering Basics

丶灬走出姿态 提交于 2019-12-06 14:02:28

问题


So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild...

My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering?


回答1:


Try Tf-idf for starters.
If you read Python, look at "Clustering text documents using MiniBatchKmeans" in scikit-learn:
"an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words approach".
Then feature_extraction/text.py in the source has very nice classes.



来源:https://stackoverflow.com/questions/8057442/document-clustering-basics

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!