scikit-learn how to know documents in the cluster?

前端 未结 2 1314
没有蜡笔的小新
没有蜡笔的小新 2020-12-28 10:58

I am new to both python and scikit-learn so please bear with me.

I took this source code for k means clustering algorithm from k means clustering.

I then modif

相关标签:
2条回答
  • 2020-12-28 11:37

    dataset.filenames is the key :)

    This is how i did it.

    load_files declaration is :

    def load_files(container_path, description=None, categories=None,
               load_content=True, shuffle=True, charset=None,
               charse_error='strict', random_state=0)
    

    so do

    dataset_files = load_files("path_to_directory_containing_category_folders");
    

    then when i got the result :

    i put them in the clusters which is a dictionary

    clusters = defaultdict(list)
    
    k = 0;
    for i in km.labels_ :
      clusters[i].append(dataset_files.filenames[k])  
      k += 1
    

    and then i print it :)

    for clust in clusters :
      print "\n************************\n"
      for filename in clusters[clust] :
        print filename
    
    0 讨论(0)
  • 2020-12-28 11:46

    Forget about the Bunch object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.

    In real life, with you real data you just have to call directly:

    km = KMeans(n_clusters).fit(my_document_features)
    

    then collect cluster assignments from:

    km.labels_
    

    my_document_features is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features).

    km.labels_ is a 1D numpy array with shape (n_documents,). Hence the first element in labels_ is the index of the cluster of the document described in the first row of the my_document_features feature matrix.

    Typically you would build my_document_features with a TfidfVectorizer object:

    my_document_features = TfidfVectorizer().fit_transform(my_text_documents)
    

    and my_text_documents would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:

    vec = TfidfVectorizer(input='filename')
    my_document_features = vec.fit_transform(my_text_files)
    

    where my_text_files is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).

    The length of the my_text_files or my_text_documents lists should be n_documents hence the mapping with km.labels_ is direct.

    As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples instead of n_documents to document the expected shapes of the arguments and attributes of all the estimator in the library.

    0 讨论(0)
提交回复
热议问题