Python Gensim: how to calculate document similarity using the LDA model?

后端 未结 3 926
遇见更好的自我
遇见更好的自我 2020-12-12 21:22

I\'ve got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutoria

相关标签:
3条回答
  • 2020-12-12 21:32

    Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.

    dictionary = corpora.Dictionary.load('dictionary.dict')
    corpus = corpora.MmCorpus("corpus.mm")
    lda = models.LdaModel.load("model.lda") #result from running online lda (training)
    
    index = similarities.MatrixSimilarity(lda[corpus])
    index.save("simIndex.index")
    
    docname = "docs/the_doc.txt"
    doc = open(docname, 'r').read()
    vec_bow = dictionary.doc2bow(doc.lower().split())
    vec_lda = lda[vec_bow]
    
    sims = index[vec_lda]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    print sims
    

    Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.

    0 讨论(0)
  • 2020-12-12 21:33

    Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.

    Training model part:

    docs = ["latent Dirichlet allocation (LDA) is a generative statistical model", 
            "each document is a mixture of a small number of topics",
            "each document may be viewed as a mixture of various topics"]
    
    # Convert document to tokens
    docs = [doc.split() for doc in docs]
    
    # A mapping from token to id in each document
    from gensim.corpora import Dictionary
    dictionary = Dictionary(docs)
    
    # Representing the corpus as a bag of words
    corpus = [dictionary.doc2bow(doc) for doc in docs]
    
    # Training the model
    model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)
    

    For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:

    # Some preprocessing for documents like the training the model
    test_doc = ["LDA is an example of a topic model",
                "topic modelling refers to the task of identifying topics"]
    test_doc = [doc.split() for doc in test_doc]
    test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]
    
    # Method 1
    from gensim.matutils import cossim
    doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
    doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
    print(cossim(doc1, doc2))
    
    # Method 2
    doc1 = model[test_corpus[0]]
    doc2 = model[test_corpus[1]]
    print(cossim(doc1, doc2))
    

    output:

    #Method 1
    0.8279631530869963
    
    #Method 2
    0.828066885140262
    

    As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here. For large corpus, the possibility vector could be given by passing the whole corpus:

    #Method 1
    possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
    #Method 2
    possiblity_vector = model[test_corpus]
    

    NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.

    0 讨论(0)
  • 2020-12-12 21:45

    Depends what similarity metric you want to use.

    Cosine similarity is universally useful & built-in:

    sim = gensim.matutils.cossim(vec_lda1, vec_lda2)
    

    Hellinger distance is useful for similarity between probability distributions (such as LDA topics):

    import numpy as np
    dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics)
    dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics)
    sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
    
    0 讨论(0)
提交回复
热议问题