topic-modeling

Gensim LDA for text classification

99封情书 提交于 2021-02-10 09:54:10
问题 I post my question here because there are already some answers on how to use scikit methods with gensim like scikit vectorizers with gensim or this but I haven't seen the whole pipeline to be used for text classification. I will try to explain a little bit my situation I want to use gensim LDA implemented methods in order to proceed further to text classification. I have one dataset which is consisted from three parts(train(25K), test(25K) and unlabeled data(50K)). What I am trying to do is

Structural Topic Modeling in R: group the topics deductively and estimate effect

核能气质少年 提交于 2021-02-08 09:16:22
问题 The stm package in R allows the user to estimate the relationship between metadata and topics. I have a model M with 40 topics, and I want to explore how they change with time. In stm, it should be something like this (adapted from Molly Roberts et al., stm: R Package for Structural Topic Models): prep = estimateEffect(1:40 ~ s(day), M, meta = out$meta, uncertainty = "Global") This command will return 40 pairs of relationships, each refers to one topic. However, upon reading the topics I

python IndexError using gensim for LDA Topic Modeling

ε祈祈猫儿з 提交于 2021-02-07 09:28:34
问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

python IndexError using gensim for LDA Topic Modeling

与世无争的帅哥 提交于 2021-02-07 09:26:59
问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

How to use DBpedia properties to build a topic hierarchy?

佐手、 提交于 2021-02-07 08:19:59
问题 I am trying to build a topic hierarchy by following the below mentioned two DBpedia properties. skos:broader property dcterms:subject property My intention is to given the word identify the topic of it. For example, given the word; 'suport vector machine', I want to identify topics from it such as classification algorithm, machine learning etc. However, sometimes I am bit confused as how to build a topic hierarchy as I am getting more than 5 URIs for subject and many URIs for broader

LDA topic modeling - Training and testing

烈酒焚心 提交于 2021-02-05 12:51:10
问题 I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents. References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

半世苍凉 提交于 2021-01-29 16:38:43
问题 Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript. For example, I tried the following: import numpy as np import gensim from gensim.corpora import Dictionary from gensim import models import nltk from nltk.stem import

Mallet DMR negative propability for feature-based topic-distribution?

删除回忆录丶 提交于 2021-01-28 05:09:37
问题 I've created a DMR Topic model (via Java API) which calculates the topic distribution based on the publication-year of the documents. The resulting distribution is a bit confusing, because there are a lot of negative propabilities . Sometimes all propabilities for a whole topic are negative values. See: Q1: Why are there negative values? The lowest possible possibility for a topic distribution for a given feature should be at least 0,0 ... I guess? Additional I build a LDA model where the

Output of lda.collapsed.gibbs.sampler command from R lda package

南楼画角 提交于 2021-01-27 23:45:09
问题 I don't understand this part of output from lda.collapsed.gibbs.sampler command. What I don't understand is why the numbers of the same word in different topics are different? For example, why for the word "test" there is 4 of them in second topics when topic 8 get 37 of them. Shouldn't number of same word in different topic be the same integer or 0? Or Do I misunderstood something and these numbers don't stand for number of word in the topic? $topics tests-loc fail test

IndexError when trying to update gensim's LdaModel

ⅰ亾dé卋堺 提交于 2020-12-26 11:04:20
问题 I am facing the following error when trying to update my gensim's LdaModel: IndexError: index 6614 is out of bounds for axis 1 with size 6614 I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error. As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : fr_documents_lda = open("documents