Output of lda.collapsed.gibbs.sampler command from R lda package

南楼画角 提交于 2021-01-27 23:45:09

问题


I don't understand this part of output from lda.collapsed.gibbs.sampler command. What I don't understand is why the numbers of the same word in different topics are different? For example, why for the word "test" there is 4 of them in second topics when topic 8 get 37 of them. Shouldn't number of same word in different topic be the same integer or 0?

Or Do I misunderstood something and these numbers don't stand for number of word in the topic?

$topics
      tests-loc fail  test testmultisendcookieget
 [1,]         0    0     0                      0
 [2,]         0    0     4                      0
 [3,]         0    0     0                      0
 [4,]         0    1     0                      0
 [5,]         0    0     0                      0
 [6,]         0    0     0                      0
 [7,]         0    0     0                      0
 [8,]         0    0    37                      0
 [9,]         0    0     0                      0
[10,]         0    0     0                      0
[11,]         0    0     0                      0
[12,]         0    2     0                      0
[13,]         0    0     0                      0
[14,]         0    0     0                      0
[15,]         0    0     0                      0
[16,]         0    0     0                      0
[17,]         0    0     0                      0
[18,]         0    0     0                      0
[19,]         0    0     0                      0
[20,]         0    0     0                      0
[21,]         0    0     0                      0
[22,]         0  361  1000                      0
[23,]         0    0     0                      0
[24,]         0    0     0                      0
[25,]         0    0     0                      0
[26,]         0    0     0                      0
[27,]         0    0     0                      0
[28,]         0 1904 12617                      0
[29,]         0    0     0                      0
[30,]         0    0     0                      0
[31,]         0    0     0                      0
[32,]         0 1255  3158                      0
[33,]         0    0     0                      0
[34,]         0    0     0                      0
[35,]         0    0     0                      0
[36,]         1    0     0                      1
[37,]         0    1     0                      0
[38,]         0    0     0                      0
[39,]         0    0     0                      0
[40,]         0    0     0                      0
[41,]         0    0     0                      0
[42,]         0    0     0                      0
[43,]         0    0     0                      0
[44,]         0    0     0                      0
[45,]         0    2     0                      0
[46,]         0    0     0                      0
[47,]         0    0     0                      0
[48,]         0    0     4                      0
[49,]         0    0     0                      0
[50,]         0    1     0                      0

Here is the code that I run.

library(lda)
data=read.documents(filename = "data.ldac")
vocab=read.vocab(filename = "words.csv")

K=100
num.iterations=100
alpha=1
eta=1


result = lda.collapsed.gibbs.sampler(data, K,vocab, num.iterations, alpha,eta, initial = NULL, burnin = NULL, compute.log.likelihood = FALSE,trace = 0L, freeze.topics = FALSE)

options(max.print=100000000) 
result

PS. Sorry for the long post and my bad english.


回答1:


The topic distributions in LDA are just that: multinomial distributions. These correspond to the rows of the matrix you have above. The probability of seeing a word in any given topic is not constrained to be a fixed value (or zero) for any of the topics. That is, the word 'test' can have a 3% chance of occurring in one topic, a 1% chance of occurring in another.

n.b. If you want to convert the matrix to probabilities just row normalize and add the smoothing constant from your prior. The function here just returns the raw number of assignments in the last Gibbs sampling sweep.



来源:https://stackoverflow.com/questions/21169266/output-of-lda-collapsed-gibbs-sampler-command-from-r-lda-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!