plot a document tfidf 2D graph

后端 未结 2 1169
失恋的感觉
失恋的感觉 2020-12-07 13:27

I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn\'s fit_transform() to get th

相关标签:
2条回答
  • 2020-12-07 14:07

    Just assign a variable to the labels and use that to denote color. ex km = Kmeans().fit(X) clusters = km.labels_.tolist() then c=clusters

    0 讨论(0)
  • 2020-12-07 14:08

    When you use Bag of Words, each of your sentences gets represented in a high dimensional space of length equal to the vocabulary. If you want to represent this in 2D you need to reduce the dimension, for example using PCA with two components:

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.decomposition import PCA
    from sklearn.pipeline import Pipeline
    import matplotlib.pyplot as plt
    
    newsgroups_train = fetch_20newsgroups(subset='train', 
                                          categories=['alt.atheism', 'sci.space'])
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
    ])        
    X = pipeline.fit_transform(newsgroups_train.data).todense()
    
    pca = PCA(n_components=2).fit(X)
    data2D = pca.transform(X)
    plt.scatter(data2D[:,0], data2D[:,1], c=data.target)
    plt.show()              #not required if using ipython notebook
    

    data2d

    Now you can for example calculate and plot the cluster enters on this data:

    from sklearn.cluster import KMeans
    
    kmeans = KMeans(n_clusters=2).fit(X)
    centers2D = pca.transform(kmeans.cluster_centers_)
    
    plt.hold(True)
    plt.scatter(centers2D[:,0], centers2D[:,1], 
                marker='x', s=200, linewidths=3, c='r')
    plt.show()              #not required if using ipython notebook
    

    enter image description here

    0 讨论(0)
提交回复
热议问题