How to run tsne on word2vec created from gensim?

后端 未结 2 2040
遇见更好的自我
遇见更好的自我 2020-12-30 18:05

I want to visualize a word2vec created from gensim library. I tried sklearn but it seems I need to install a developer version to get it. I tried installing the developer ve

相关标签:
2条回答
  • 2020-12-30 18:30

    Use the code below, instead of X concat all your word embeddings vertically using numpy.vstack into a matrix X and then fit_transform it.

    import numpy as np
    from sklearn.manifold import TSNE
    X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
    model = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    model.fit_transform(X) 
    

    the output of fit_transform has shape vocab_size x 2 so you can visualise it.

    vocab = sorted(word2vec_model.get_vocab()) #not sure the exact api
    emb_tuple = tuple([word2vec_model[v] for v in vocab])
    X = numpy.vstack(emb_tuple)
    
    0 讨论(0)
  • 2020-12-30 18:38

    You don't need a developer version of scikit-learn - just install scikit-learn the usual way via pip or conda.

    To access the word vectors created by word2vec simply use the word dictionary as index into the model:

    X = model[model.wv.vocab]
    

    Following is a simple but complete code example which loads some newsgroup data, applies very basic data preparation (cleaning and breaking up sentences), trains a word2vec model, reduces the dimensions with t-SNE, and visualizes the output.

    from gensim.models.word2vec import Word2Vec
    from sklearn.manifold import TSNE
    from sklearn.datasets import fetch_20newsgroups
    import re
    import matplotlib.pyplot as plt
    
    # download example data ( may take a while)
    train = fetch_20newsgroups()
    
    def clean(text):
        """Remove posting header, split by sentences and words, keep only letters"""
        lines = re.split('[?!.:]\s', re.sub('^.*Lines: \d+', '', re.sub('\n', ' ', text)))
        return [re.sub('[^a-zA-Z]', ' ', line).lower().split() for line in lines]
    
    sentences = [line for text in train.data for line in clean(text)]
    
    model = Word2Vec(sentences, workers=4, size=100, min_count=50, window=10, sample=1e-3)
    
    print (model.wv.most_similar('memory'))
    
    X = model.wv[model.wv.vocab]
    
    tsne = TSNE(n_components=2)
    X_tsne = tsne.fit_transform(X)
    
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
    plt.show()
    
    0 讨论(0)
提交回复
热议问题