Mallet topic modeling - topic keys output parameter

人盡茶涼 提交于 2020-01-02 08:58:26

问题


In MALLET topic modelling, the --output-topic-keys [FILENAME] option outputs beside each topic a parameter that in the tutorial in the MALLET site called "Dirichlet parameter " of the topic.

I want to know what does this parameter represent? is it β in the LDA model? and if not what is it and what is it's meaning and use.

I noted that when I don't use the parameter optimization option while generating the topic model, this parameter differs in version 2.0.7 than in version 2.0.8. I want to know why this difference happens.

here's version 2.0.7 output

and 2.0.8

I know that the output differs by each run, but I am only concerned with this parameter.


回答1:


The topic model inference algorithm used in Mallet involves repeatedly sampling new topic assignments for each word holding the assignments of all other words fixed. The factors that control this process are (1) how often the current word type appears in each topic and (2) how many times each topic appears in the current document. The smoothing parameters ensure that these values are never zero for any topic: beta for the first factor, alpha for the second.

You can think of the alpha parameter being displayed here as the number of "imaginary" words in each topic that are added. In the first case, topic 0 has 2.5 imaginary words of weight in every document. The default value for this parameter was initially 50 / numTopics. Larger values encourage models to have more uniform topic distributions in documents, smaller values encourage more sparsity. The general experience was that 50 was too large, and that 5 is a better default. This was changed in 2.0.8.

The default is to make the alpha weight equal for all topics. With hyperparameter optimization on, these values can vary. Usually what you will find is that a topic with a large value will contain "near stopwords" that are frequent in most documents and don't have much content. Topics with very small values are often unusual and distinctive documents. Topics in the middle are often the most interesting.




回答2:


If I understand it correctly, the parameter is alpha, not beta.

You can use an asymmetric alpha using the flag

--optimize-interval INTEGER

to reestimate the hyperparameters every INTEGER iterations.



来源:https://stackoverflow.com/questions/45162186/mallet-topic-modeling-topic-keys-output-parameter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!