lda | 易学教程

How to plot classification borders on an Linear Discrimination Analysis plot in R

阅读更多关于 How to plot classification borders on an Linear Discrimination Analysis plot in R

问题 I have used a linear discriminant analysis (LDA) to investigate how well a set of variables discriminates between 3 groups. I then used the plot.lda() function to plot my data on the two linear discriminants (LD1 on the x-axis and LD2 on the y-axis). I would now like to add the classification borders from the LDA to the plot. I cannot see a argument in the function that allows this. The partimat() function allows visualisation of the LD classification borders, but variables are used as the x

Gensim LDA for text classification

阅读更多关于 Gensim LDA for text classification

问题 I post my question here because there are already some answers on how to use scikit methods with gensim like scikit vectorizers with gensim or this but I haven't seen the whole pipeline to be used for text classification. I will try to explain a little bit my situation I want to use gensim LDA implemented methods in order to proceed further to text classification. I have one dataset which is consisted from three parts(train(25K), test(25K) and unlabeled data(50K)). What I am trying to do is

Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

阅读更多关于 Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

问题 First, apologies for being long-winded. I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you

Efficient transformation of gensim TransformedCorpus data to array

阅读更多关于 Efficient transformation of gensim TransformedCorpus data to array

问题 Is there a more direct or efficient method for getting the topic probabilities data from a gensim.interfaces.TransformedCorpus object into a numpy array (or alternatively, pandas dataframe) than the by-row method below? from gensim import models import numpy as np num_topics = 5 model = models.LdaMulticore(corpus, num_topics=num_topics, minimum_probability=0.0) all_topics = model.get_document_topics(corpus) num_docs = len(all_topics) lda_scores = np.empty([num_docs, num_topics]) for i in

Efficient transformation of gensim TransformedCorpus data to array

阅读更多关于 Efficient transformation of gensim TransformedCorpus data to array

python IndexError using gensim for LDA Topic Modeling

阅读更多关于 python IndexError using gensim for LDA Topic Modeling

问题 Another thread has a similar question to mine but leaves out reproducible code. The goal with the script in question is to create a process that is as memory efficient as possible. So I tried to write a the class corpus() to take advantage of gensims' capabilities. However, I am running into an IndexError that I'm not sure how to resolve when creating lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) . The documents that I am

python IndexError using gensim for LDA Topic Modeling

阅读更多关于 python IndexError using gensim for LDA Topic Modeling

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

阅读更多关于 How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

阅读更多关于 How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

阅读更多关于 Hierarchical Dirichlet Process Gensim topic number independent of corpus size

问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the