Using gensim
I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models?
When printing the
After some messing around, it seems like print_topics(numoftopics)
for the ldamodel
has some bug. So my workaround is to use print_topic(topicid)
:
>>> print lda.print_topics()
None
>>> for i in range(0, lda.num_topics-1):
>>> print lda.print_topic(i)
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system
...
you can use:
for i in lda_model.show_topics():
print i[0], i[1]
You can also export the top words from each topic to a csv file. topn
controls how many words under each topic to export.
import pandas as pd
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 5)])
pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")
The CSV file has the following format
Topic Word P
0 w1 0.004437
0 w2 0.003553
0 w3 0.002953
0 w4 0.002866
0 w5 0.008813
1 w6 0.003393
1 w7 0.003289
1 w8 0.003197
...
****This code works fine but I want to know the topic name instead of Topic: 0 and Topic:1, How do i know which topic this word comes in**?**
for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))
Topic: 0
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']
I think it is alway more helpful to see the topics as a list of words. The following code snippet helps acchieve that goal. I assume you already have an lda model called lda_model
.
for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))
In the above code, I have decided to show the first 30 words belonging to each topic. For simplicity, I have shown the first topic I get.
Topic: 0
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']
I don't really like the way the above topics look so I usually modify my code to as shown:
for idx, topic in lda_model.show_topics(formatted=False, num_words= 30):
print('Topic: {} \nWords: {}'.format(idx, '|'.join([w[0] for w in topic])))
... and the output (first 2 topics shown) will look like.
Topic: 0
Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental
Topic: 1
Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head
I think syntax of show_topics has changed over time:
show_topics(num_topics=10, num_words=10, log=False, formatted=True)
For num_topics number of topics, return num_words most significant words (10 words per topic, by default).
The topics are returned as a list – a list of strings if formatted is True, or a list of (probability, word) 2-tuples if False.
If log is True, also output this result to log.
Unlike LSA, there is no natural ordering between the topics in LDA. The returned num_topics <= self.num_topics subset of all topics is therefore arbitrary and may change between two LDA training runs.