lda

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

痴心易碎 提交于 2021-02-06 02:30:02
问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the

LDA topic modeling - Training and testing

烈酒焚心 提交于 2021-02-05 12:51:10
问题 I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents. References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned

CalledProcessError: Returned non-zero exit status 1

北城余情 提交于 2021-02-04 08:27:28
问题 When I try to run: def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod1[doc] for doc in texts] # Remove Stop Words data_words_nostops1 = remove_stopwords(data_words1) # Form Bigrams data_words_bigrams1 = make_bigrams(data_words_nostops1) # Create Dictionary id2word1 = corpora.Dictionary(data_words_bigrams1) # Create Corpus texts1 = data_words_bigrams1 # Term Document

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

半世苍凉 提交于 2021-01-29 16:38:43
问题 Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript. For example, I tried the following: import numpy as np import gensim from gensim.corpora import Dictionary from gensim import models import nltk from nltk.stem import

How to tune the parameters for gensim `LdaMulticore` in Python

夙愿已清 提交于 2021-01-29 08:24:41
问题 I was running gensim LdaMulticore package for the topic modelling using Python. I tried to understand the meaning of the parameters within LdaMulticore and found the website that provides some explanations on the usage of the parameters. As a non-expert, I have some difficulty understanding these intuitively. I also referred some other materials from the website but I guess this page gives relatively full explanations of every parameters. This page chunksize Number of documents to be used in

Building LDAvis plots using phrase tokens instead of single word tokens

孤街浪徒 提交于 2021-01-29 07:47:36
问题 My question is very simple. How can one build ldavis's frequentist topic modeling plots with phrase tokens instead of single-word tokens using the text2vec package in R. Currently, the word tokenizer tokens = word_tokenizer(tokens) works great but is there a phrase or ngram tokenizer functionality to enable building ldavis topic models and corresponding plots with phrases instead of words? If not, how might such a code be constructed? Is this even methodologically sound or advisable? 来源:

Output of lda.collapsed.gibbs.sampler command from R lda package

南楼画角 提交于 2021-01-27 23:45:09
问题 I don't understand this part of output from lda.collapsed.gibbs.sampler command. What I don't understand is why the numbers of the same word in different topics are different? For example, why for the word "test" there is 4 of them in second topics when topic 8 get 37 of them. Shouldn't number of same word in different topic be the same integer or 0? Or Do I misunderstood something and these numbers don't stand for number of word in the topic? $topics tests-loc fail test

IndexError when trying to update gensim's LdaModel

ⅰ亾dé卋堺 提交于 2020-12-26 11:04:20
问题 I am facing the following error when trying to update my gensim's LdaModel: IndexError: index 6614 is out of bounds for axis 1 with size 6614 I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error. As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : fr_documents_lda = open("documents

IndexError when trying to update gensim's LdaModel

放肆的年华 提交于 2020-12-26 11:02:18
问题 I am facing the following error when trying to update my gensim's LdaModel: IndexError: index 6614 is out of bounds for axis 1 with size 6614 I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error. As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : fr_documents_lda = open("documents

Transforming a gensim.interfaces.TransformedCorpus to a readable result

喜夏-厌秋 提交于 2020-12-15 06:31:55
问题 I am using the the Mallet LDA with gensims implemented wrapper. Now I want to get the Topic distribution of several unseen documents, store it in a nested list and then print it out. This is my code: other_texts = [ ['wlan', 'usb', 'router'], ['auto', 'auto', 'auto'], ['human', 'system', 'computer'] ] corpus1 = [id2word.doc2bow(text) for text in other_texts] to_pro = [] for t in corpus1: unseen_doc = corpus1 vector = lda[unseen_doc] # get topic probability distribution for a document to_pro