topic-modeling | 易学教程

Gensim LDA for text classification

阅读更多关于 Gensim LDA for text classification

问题 I post my question here because there are already some answers on how to use scikit methods with gensim like scikit vectorizers with gensim or this but I haven't seen the whole pipeline to be used for text classification. I will try to explain a little bit my situation I want to use gensim LDA implemented methods in order to proceed further to text classification. I have one dataset which is consisted from three parts(train(25K), test(25K) and unlabeled data(50K)). What I am trying to do is

Structural Topic Modeling in R: group the topics deductively and estimate effect

阅读更多关于 Structural Topic Modeling in R: group the topics deductively and estimate effect

问题 The stm package in R allows the user to estimate the relationship between metadata and topics. I have a model M with 40 topics, and I want to explore how they change with time. In stm, it should be something like this (adapted from Molly Roberts et al., stm: R Package for Structural Topic Models): prep = estimateEffect(1:40 ~ s(day), M, meta = out$meta, uncertainty = "Global") This command will return 40 pairs of relationships, each refers to one topic. However, upon reading the topics I

python IndexError using gensim for LDA Topic Modeling

阅读更多关于 python IndexError using gensim for LDA Topic Modeling

问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

python IndexError using gensim for LDA Topic Modeling

阅读更多关于 python IndexError using gensim for LDA Topic Modeling

How to use DBpedia properties to build a topic hierarchy?

阅读更多关于 How to use DBpedia properties to build a topic hierarchy?

问题 I am trying to build a topic hierarchy by following the below mentioned two DBpedia properties. skos:broader property dcterms:subject property My intention is to given the word identify the topic of it. For example, given the word; 'suport vector machine', I want to identify topics from it such as classification algorithm, machine learning etc. However, sometimes I am bit confused as how to build a topic hierarchy as I am getting more than 5 URIs for subject and many URIs for broader

LDA topic modeling - Training and testing

阅读更多关于 LDA topic modeling - Training and testing

问题 I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents. References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

阅读更多关于 Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

问题 Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript. For example, I tried the following: import numpy as np import gensim from gensim.corpora import Dictionary from gensim import models import nltk from nltk.stem import

Mallet DMR negative propability for feature-based topic-distribution?

阅读更多关于 Mallet DMR negative propability for feature-based topic-distribution?

问题 I've created a DMR Topic model (via Java API) which calculates the topic distribution based on the publication-year of the documents. The resulting distribution is a bit confusing, because there are a lot of negative propabilities . Sometimes all propabilities for a whole topic are negative values. See: Q1: Why are there negative values? The lowest possible possibility for a topic distribution for a given feature should be at least 0,0 ... I guess? Additional I build a LDA model where the

Output of lda.collapsed.gibbs.sampler command from R lda package

阅读更多关于 Output of lda.collapsed.gibbs.sampler command from R lda package

问题 I don't understand this part of output from lda.collapsed.gibbs.sampler command. What I don't understand is why the numbers of the same word in different topics are different? For example, why for the word "test" there is 4 of them in second topics when topic 8 get 37 of them. Shouldn't number of same word in different topic be the same integer or 0? Or Do I misunderstood something and these numbers don't stand for number of word in the topic? $topics tests-loc fail test

IndexError when trying to update gensim's LdaModel

阅读更多关于 IndexError when trying to update gensim's LdaModel

问题 I am facing the following error when trying to update my gensim's LdaModel: IndexError: index 6614 is out of bounds for axis 1 with size 6614 I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error. As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : fr_documents_lda = open("documents