Document topical distribution in Gensim LDA

点点圈 提交于 2019-11-27 17:30:33

问题


I've derived a LDA topic model using a toy corpus as follows:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. E.g.:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

However when I use a large number of topics, the report is no longer complete:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output.

I'm wondering if this behaviour is due to some aesthetic considerations? And how can I get the distribution of the probability mass residual over all other topics?

Thank you for your kind answer!


回答1:


Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. This threshold is with a default value of 0.01.




回答2:


I realise this is an old question but in case someone stumbles upon it, here is a solution (the issue has actually been fixed in the current development branch with a minimum_probability parameter to LdaModel but maybe you're running an older version of gensim).

define a new function (this is just copied from the source)

def get_doc_topics(lda, bow):
    gamma, _ = lda.inference([bow])
    topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)]

the above function does not filter the output topics based on the probability but will output all of them. If you don't need the (topic_id, value) tuples but just values, just return the topic_dist instead of the list comprehension (it'll be much faster as well).



来源:https://stackoverflow.com/questions/17310933/document-topical-distribution-in-gensim-lda

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!