How to understand the output of Topic Model class in Mallet?

问题

As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code.

First during the running process, it gives out:

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

so Question1: what does "Coded LDA: 10 topics, 4 topic bits, 1111 topic mask" mean in the first line? I only know what "10 topics" is about.

Question2: what does LL/Token in " <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8,95386 <40> LL/token: -8,75353" mean？ it seems like a metric to Gibss sampling. But isn't it monotonically increasing?

And after that, the following is printed:

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

The first line in this part is probably the token-topic assignment, right?

Question3: for the first topic,

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0.008 is said to be the "topic distribution", is it the distribution of this topic in whole corpus? Then there seems to be a conflict: topic 0 as shown above will have its token appeared in the copus 8+7+6+4+4+... times; and in comparison topic 7 have 4+3+3+3+3... times recognized in the corpus. As a result, topic 7 should have lower distribution than topic 0. This is what I cann't understand. Further more, what ist that "0 0.55" at the end?

Thank you very much for reading this long post. Hope you can answer it and hope this could be helpful for others interested in Mallet.

best

回答1:

I don't think I know enough to give a very complete answer, but here's a shot at some of it... for Q1 you can inspect some code to see how those values are calculated. For Q2, LL is the model's log-liklihood divided by the total number of tokens, this is a measure of how likely the data are given the model. Increasing values mean the model is improving. These are also available in the R packages for topic modeling. Q2, yes I think that's right for the first line. Q3, good question, it's not immediately clear to me, perhaps the (x) are some kind of index, token frequency seems unlikely... Presumably most of these are diagnostics of some kind.

A more useful set of diagnostics can be obtained with bin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xml which will produce a large number of measures of topic quality. They're definitely worth checking out.

For the full story about all of this I'd suggest writing an email to David Mimno at Princeton who is the (main?) maintainer of MALLET, or writing to him via the list at http://blog.gmane.org/gmane.comp.ai.mallet.devel and then posting answers back here for those of us curious about the inner workings of MALLET...

回答2:

what I understand is that:

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0 is the topic number.
0.008 is the weight of such topic
battle (8) union (7) [...] are the top-keywords in such topic. The numbers are the occurrences of the word in the topic.

Then, as result, you also obtain a .csv file. I think it contains the most important data of the process. You will find values like the following for each row:

0   0   285 10   page make items thing work put dec browsers recipes expressions

That is:

Tree level
Topic ID
Total words
Total documents
Top-10 words

A little bit late, but I hope it helps someone

回答3:

For question 3, I believe the 0.008 (the "topic distribution") relates to the prior \alpha over topic distributions for documents. Mallet optimises this prior, essentially allowing some topics to carry more "weight". Mallet seems to be estimating that topic 0 accounts for a small proportion of your corpus.

The token counts represent only the words with highest counts. The remaining counts for topic 0 could, for example, be 0, and the remaining counts for topic 9 could be 3. Thus topic 9 can account for many more words in your corpus than topic 0, even though the counts for the top words are lower.

I'd have to check out the code for the "0 0.55" at the end, but that's probably the optimised \beta value (which I'm pretty sure isn't done asymetrically).

来源：https://stackoverflow.com/questions/8447393/how-to-understand-the-output-of-topic-model-class-in-mallet

标签

machine-learning

topic-modeling

mallet