Inefficiency of topic modelling for text clustering

梦想与她 提交于 2019-12-02 14:30:54

问题


I tried doing text clustering using LDA, but it isn't giving me distinct clusters. Below is my code

#Import libraries
from gensim import corpora, models
import pandas as pd
from gensim.parsing.preprocessing import STOPWORDS
from itertools import chain

#stop words
stoplist = list(STOPWORDS)
new = ['education','certification','certificate','certified']
stoplist.extend(new)
stoplist.sort()

#read data
dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist()
#remove stop words
texts = [[word for word in document.lower().split() if word not in stoplist] for document in dat]
#dictionary
dictionary = corpora.Dictionary(texts)
#corpus
corpus = [dictionary.doc2bow(text) for text in texts]
#train model
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=25, workers=4,minimum_probability=0)
#print topics
lda.print_topics(num_topics=25, num_words=7)
#get corpus
lda_corpus = lda[corpus]
#calculate cutoff score
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))


#threshold
threshold = sum(scores)/len(scores)
threshold
**0.039999999971137644**

#cluster1
cluster1 = [j for i,j in zip(lda_corpus,dat) if i[0][1] > threshold]

#cluster2
cluster2 = [j for i,j in zip(lda_corpus,dat) if i[1][1] > threshold]

The problem is there are overlapping elements in cluster1, which tend to be present in cluster2 and so on.

I also tried to increase threshold manually to 0.5, however it is giving me the same issue


回答1:


That is just realistic.

Neither documents or words are usually uniquely assignable to a single cluster.

If you'd manually label some data, you will also quickly find some documents that cannot be clearly labeled as one or the other. So it's good I'd the algorithm doesn't pretend there were a good unique assignment.



来源:https://stackoverflow.com/questions/49380258/inefficiency-of-topic-modelling-for-text-clustering

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!