lda | 易学教程

Simple Python implementation of collaborative topic modeling?

阅读更多关于 Simple Python implementation of collaborative topic modeling?

I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF) are: " Collaborative Topic Modeling for Recommending Scientific Articles " and " Collaborative Topic Modeling for Recommending GitHub Repositories " The new algorithm is called collaborative topic regression . I was hoping to find some python code that implemented this but to no avail. This might be a long shot but can someone show a simple python

Inefficiency of topic modelling for text clustering

阅读更多关于 Inefficiency of topic modelling for text clustering

问题 I tried doing text clustering using LDA, but it isn't giving me distinct clusters. Below is my code #Import libraries from gensim import corpora, models import pandas as pd from gensim.parsing.preprocessing import STOPWORDS from itertools import chain #stop words stoplist = list(STOPWORDS) new = ['education','certification','certificate','certified'] stoplist.extend(new) stoplist.sort() #read data dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist() #remove stop words texts =

Inefficiency of topic modelling for text clustering

阅读更多关于 Inefficiency of topic modelling for text clustering

I tried doing text clustering using LDA, but it isn't giving me distinct clusters. Below is my code #Import libraries from gensim import corpora, models import pandas as pd from gensim.parsing.preprocessing import STOPWORDS from itertools import chain #stop words stoplist = list(STOPWORDS) new = ['education','certification','certificate','certified'] stoplist.extend(new) stoplist.sort() #read data dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist() #remove stop words texts = [[word for word in document.lower().split() if word not in stoplist] for document in dat] #dictionary

Memory error in python using numpy array

阅读更多关于 Memory error in python using numpy array

I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1] print('*Topic {}\n- {}'

How to get topic vector of new documents and compare with pre-defined topic model in Mallet?

阅读更多关于 How to get topic vector of new documents and compare with pre-defined topic model in Mallet?

问题 I'm trying to somehow compare a sole document's topic distribution (using LDA) with, other files and their topic distributions within a previously created topic model, using MALLET. I know that this can be done through MALLET commands in terminal but I'm having problems in finding a way to implement this in Java. To give a gist of what the functionality of my program is: The already created topic model was created with a large corpus of texts. I want to use this to compare topic distributions

How to get topic vector of new documents and compare with pre-defined topic model in Mallet?

阅读更多关于 How to get topic vector of new documents and compare with pre-defined topic model in Mallet?

I'm trying to somehow compare a sole document's topic distribution (using LDA) with, other files and their topic distributions within a previously created topic model, using MALLET. I know that this can be done through MALLET commands in terminal but I'm having problems in finding a way to implement this in Java. To give a gist of what the functionality of my program is: The already created topic model was created with a large corpus of texts. I want to use this to compare topic distributions with a tweet that contains a certain hashtag and to then pull out the file most similar to the tweet

How to reproduce exact results with LDA function in R's topicmodels package

阅读更多关于 How to reproduce exact results with LDA function in R's topicmodels package

I've been unable to create reproducible results from topicmodels' LDA function. To take an example from their documentation: library(topicmodels) set.seed(0) lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) set.seed(0) lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) identical(lda1, lda2) # [1] FALSE How can I get identical results from two separate calls to LDA? As an aside (in case the package authors are on here), I find the control=list(seed=0) snippet unfortunate and unnecessary. Behind the scenes, there's a line for if (missing(seed)) seed <- as.integer(Sys

How to reproduce exact results with LDA function in R's topicmodels package

阅读更多关于 How to reproduce exact results with LDA function in R's topicmodels package

问题 I've been unable to create reproducible results from topicmodels' LDA function. To take an example from their documentation: library(topicmodels) set.seed(0) lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) set.seed(0) lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) identical(lda1, lda2) # [1] FALSE How can I get identical results from two separate calls to LDA? As an aside (in case the package authors are on here), I find the control=list(seed=0) snippet

In R tm package, build corpus FROM Document-Term-Matrix

阅读更多关于 In R tm package, build corpus FROM Document-Term-Matrix

It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab

In R tm package, build corpus FROM Document-Term-Matrix

阅读更多关于 In R tm package, build corpus FROM Document-Term-Matrix

问题 It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to