lda

How to plot classification borders on an Linear Discrimination Analysis plot in R

有些话、适合烂在心里 提交于 2021-02-16 08:54:47
问题 I have used a linear discriminant analysis (LDA) to investigate how well a set of variables discriminates between 3 groups. I then used the plot.lda() function to plot my data on the two linear discriminants (LD1 on the x-axis and LD2 on the y-axis). I would now like to add the classification borders from the LDA to the plot. I cannot see a argument in the function that allows this. The partimat() function allows visualisation of the LD classification borders, but variables are used as the x

Gensim LDA for text classification

99封情书 提交于 2021-02-10 09:54:10
问题 I post my question here because there are already some answers on how to use scikit methods with gensim like scikit vectorizers with gensim or this but I haven't seen the whole pipeline to be used for text classification. I will try to explain a little bit my situation I want to use gensim LDA implemented methods in order to proceed further to text classification. I have one dataset which is consisted from three parts(train(25K), test(25K) and unlabeled data(50K)). What I am trying to do is

Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

别来无恙 提交于 2021-02-08 06:38:36
问题 First, apologies for being long-winded. I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you

Efficient transformation of gensim TransformedCorpus data to array

懵懂的女人 提交于 2021-02-07 10:13:30
问题 Is there a more direct or efficient method for getting the topic probabilities data from a gensim.interfaces.TransformedCorpus object into a numpy array (or alternatively, pandas dataframe) than the by-row method below? from gensim import models import numpy as np num_topics = 5 model = models.LdaMulticore(corpus, num_topics=num_topics, minimum_probability=0.0) all_topics = model.get_document_topics(corpus) num_docs = len(all_topics) lda_scores = np.empty([num_docs, num_topics]) for i in

Efficient transformation of gensim TransformedCorpus data to array

℡╲_俬逩灬. 提交于 2021-02-07 10:12:27
问题 Is there a more direct or efficient method for getting the topic probabilities data from a gensim.interfaces.TransformedCorpus object into a numpy array (or alternatively, pandas dataframe) than the by-row method below? from gensim import models import numpy as np num_topics = 5 model = models.LdaMulticore(corpus, num_topics=num_topics, minimum_probability=0.0) all_topics = model.get_document_topics(corpus) num_docs = len(all_topics) lda_scores = np.empty([num_docs, num_topics]) for i in

python IndexError using gensim for LDA Topic Modeling

ε祈祈猫儿з 提交于 2021-02-07 09:28:34
问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

python IndexError using gensim for LDA Topic Modeling

与世无争的帅哥 提交于 2021-02-07 09:26:59
问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

荒凉一梦 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

天大地大妈咪最大 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

我的梦境 提交于 2021-02-06 02:35:47
问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the