Doc2vec : How to get document vectors

匿名 (未验证) 提交于 2019-12-03 01:23:02

问题:

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in right direction/help me with some tutorial

I am using gensim python library.

doc1=["This is a sentence","This is another sentence"] documents1=[doc.strip().split(" ") for doc in doc1 ] model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4) 

I get AttributeError: 'list' object has no attribute 'words' whenever I run this

回答1:

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

However, @bee2502 was right with

docvec = model.docvecs[99]  

It will should the 100th vector's value for trained model, it works with integers and strings.



回答2:

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).

# Import libraries  from gensim.models import doc2vec from collections import namedtuple  # Load data  doc1 = ["This is a sentence", "This is another sentence"]  # Transform data (you can add more data preprocessing steps)   docs = [] analyzedDocument = namedtuple('AnalyzedDocument', 'words tags') for i, text in enumerate(doc1):     words = text.lower().split()     tags = [i]     docs.append(analyzedDocument(words, tags))  # Train model (set min_count = 1, if you want the model to work with the provided example data set)  model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)  # Get the vectors  model.docvecs[0] model.docvecs[1] 

UPDATE (how to train in epochs): Doc2Vec function contains alpha and min_alpha parameters, but that means that the learning rate decays during one epoch from alpha to min_alpha. To train several epochs, set the learning rate manually, like this:

from gensim.models import doc2vec import random  alpha_val = 0.025        # Initial learning rate min_alpha_val = 1e-4     # Minimum for linear learning rate decay passes = 15              # Number of passes of one document during training  alpha_delta = (alpha_val - min_alpha_val) / (passes - 1)  model = doc2vec.Doc2Vec( size = 100 # Model initialization     , window = 300     , min_count = 1     , workers = 4)  model.build_vocab(docs) # Building vocabulary  for epoch in range(passes):      # Shuffling gets better results      random.shuffle(docs)      # Train      model.alpha, model.min_alpha = alpha_val, alpha_val      model.train(docs)      # Logs      print('Completed pass %i at alpha %f' % (epoch + 1, alpha_val))      # Next run alpha      alpha_val -= alpha_delta 


回答3:

doc=["This is a sentence","This is another sentence"] documents=[doc.strip().split(" ") for doc in doc1 ] model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4) 

I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])  

More details are here : http://rare-technologies.com/doc2vec-tutorial/ However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

sentences=doc2vec.TaggedLineDocument(file_path) model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4) 

To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

docvec = model.docvecs[99]  

where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!