Doc2vec: How to get document vectors

后端 未结 4 1405
南笙
南笙 2020-12-12 14:25

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with som

4条回答
  •  抹茶落季
    2020-12-12 14:55

    doc=["This is a sentence","This is another sentence"]
    documents=[doc.strip().split(" ") for doc in doc1 ]
    model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)
    

    I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

    documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1']) 
    

    More details are here : http://rare-technologies.com/doc2vec-tutorial/ However, I solved the problem by taking input data from file using TaggedLineDocument().
    File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

    sentences=doc2vec.TaggedLineDocument(file_path)
    model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)
    

    To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

    docvec = model.docvecs[99] 
    

    where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

提交回复
热议问题