Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

半世苍凉 提交于 2021-01-29 16:38:43

问题


Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.

For example, I tried the following:

import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')


def Preprocesser(text_list):

    smallestWordSize = 3
    processedList = []

    for token in gensim.utils.simple_preprocess(text_list):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
            processedList.append(StemmAndLemmatize(token))

    return processedList

lda_model = models.LdaModel.load('LDAModel\GoldModel')  #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict")      #Load dictionary model was trained on

#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"

termTopicMatrix = lda_model.get_topics()    #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc)                #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc)       #Create bow using dictionary
dictSize = len(termTopicMatrix[0])          #Get length of terms in dictionary
fullDict = np.zeros(dictSize)               #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc]      #Get index of terms in bag of words
Second = [second[1] for second in bowDoc]   #Get frequency of term in bag of words
fullDict[First] = Second                    #Add word frequency to full dictionary


print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])

Output:
Matrix Multiplication: 
 [0.0283254  0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
 0.01558603 0.0370233  0.04648389 0.02887623 0.00776652 0.02147539
 0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
 0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
 0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
 0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax: 
 [(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]

In the pretrained model there are 35 topics and 1155 words.

In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.

For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?


回答1:


You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:

https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283

(It also makes use of the inference() method in the same file.)

It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.

It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.



来源:https://stackoverflow.com/questions/62200198/is-there-a-way-to-infer-topic-distributions-on-unseen-document-from-gensim-lda-p

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!