Doc2vec: How to get document vectors

后端 未结 4 1411
南笙
南笙 2020-12-12 14:25

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with som

4条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-12 14:47

    If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).

    # Import libraries
    
    from gensim.models import doc2vec
    from collections import namedtuple
    
    # Load data
    
    doc1 = ["This is a sentence", "This is another sentence"]
    
    # Transform data (you can add more data preprocessing steps) 
    
    docs = []
    analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
    for i, text in enumerate(doc1):
        words = text.lower().split()
        tags = [i]
        docs.append(analyzedDocument(words, tags))
    
    # Train model (set min_count = 1, if you want the model to work with the provided example data set)
    
    model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
    
    # Get the vectors
    
    model.docvecs[0]
    model.docvecs[1]
    

    UPDATE (how to train in epochs): This example became outdated, so I deleted it. For more information on training in epochs, see this answer or @gojomo's comment.

提交回复
热议问题