lda

How do you initialize a gensim corpus variable with a csr_matrix?

阅读更多关于 How do you initialize a gensim corpus variable with a csr_matrix?

问题 I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words, I don't want to download a corpus as shown in gensim's documentation nor convert X to a dense matrix, since it would consume a lot of memory and the computer could hang. In short, my questions are the following, How do you initialize a gensim

Use scikit-learn TfIdf with gensim LDA

阅读更多关于 Use scikit-learn TfIdf with gensim LDA

I've used various versions of TFIDF in scikit learn to model some text data. vectorizer = TfidfVectorizer(min_df=1,stop_words='english') The resulting data X is in this format: <rowsxcolumns sparse matrix of type '<type 'numpy.float64'>' with xyz stored elements in Compressed Sparse Row format> I wanted to experiment with LDA as a way to do reduce dimensionality of my sparse matrix. Is there a simple way to feed the NumPy sparse matrix X into a gensim LDA model? lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100) I can ignore scikit and go the way the gensim

Topic Modeling: How do I use my fitted LDA model to predict new topics for a new dataset in R?

阅读更多关于 Topic Modeling: How do I use my fitted LDA model to predict new topics for a new dataset in R?

问题 I am using 'lda' package in R for topic modeling. I want to predict new topics(collection of related words in a document) using a fitted Latent Dirichlet Allocation(LDA) model for new dataset. In the process, I came across predictive.distribution() function. But the function takes document_sums as input parameter which is an output of the result after fitting the new model. I need help to understand the use of existing model on new dataset and predict topics. Here is the example code present

Document topical distribution in Gensim LDA

阅读更多关于 Document topical distribution in Gensim LDA

I've derived a LDA topic model using a toy corpus as follows: documents = ['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey'] texts = [[word for word in document.lower().split()] for

How does the removeSparseTerms in R work?

阅读更多关于 How does the removeSparseTerms in R work?

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix. How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc? In the sense of the sparse argument to removeSparseTerms() , sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document

Predicting LDA topics for new data

阅读更多关于 Predicting LDA topics for new data

It looks like this question has may have been asked a few times before ( here and here ), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers. Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data. Ultimately I would like extract a smaller set of topics from a very large bag of words and build a

Python Gensim: how to calculate document similarity using the LDA model?

阅读更多关于 Python Gensim: how to calculate document similarity using the LDA model?

问题 I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks! 回答1: Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query. dictionary = corpora.Dictionary.load('dictionary.dict') corpus = corpora

阅读更多关于 LDA

LDA简介： LDA的全称是Linear Discriminant Analysis（线性判别分析），是一种supervised learning。因为是由Fisher在1936年提出的，所以也叫Fisher’s Linear Discriminant。 LDA通常作为数据预处理阶段的降维技术，其目标是将数据投影到低维空间来避免维度灾难（curse of dimensionality）引起的过拟合，同时还保留着良好的可分性。 LDA的引出：经常经过特征提取以后，我们需要进行降维。首先我们简化一下问题便于阐述其原理: 假设在二维特征空间中，有两类样本，那么我们的目标就是对给定的数据集，将其投影到一条直线上，但是投影的方法有千千万万种，那么我们改选择什么样的投影呢？首先我们的任务是为了分类服务的，那么我们需要投影后的样本尽可能的分开，最简单的度量类别之间分开程度的方式就是类别均值投影之后的距离。一种比较好的投影方式就是利用不同类别的数据的中心来代表这类样本在空间中的位置，考虑1个2分类问题。两类的均值向量为：同时保证让投影之后的中心距离尽可能的大，也就是：其中是来自类别的投影数据的均值，是我们的投影向量。但是，通过增大w，这个表达式可以任意增大。为了解决这个问题，我们可以将w限制为单位长度，即。使用拉格朗日乘数法来进行有限制条件的最大化问题的求解，我们可以发现

How to print the LDA topics models from gensim? Python

阅读更多关于 How to print the LDA topics models from gensim? Python

Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a NoneType : Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable The code: from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip documents = ["Human machine

Document topical distribution in Gensim LDA

阅读更多关于 Document topical distribution in Gensim LDA

问题 I've derived a LDA topic model using a toy corpus as follows: documents = ['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well