lda

Simple Python implementation of collaborative topic modeling?

不羁岁月 提交于 2019-12-02 17:06:29
I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF) are: " Collaborative Topic Modeling for Recommending Scientific Articles " and " Collaborative Topic Modeling for Recommending GitHub Repositories " The new algorithm is called collaborative topic regression . I was hoping to find some python code that implemented this but to no avail. This might be a long shot but can someone show a simple python

Inefficiency of topic modelling for text clustering

梦想与她 提交于 2019-12-02 14:30:54
问题 I tried doing text clustering using LDA, but it isn't giving me distinct clusters. Below is my code #Import libraries from gensim import corpora, models import pandas as pd from gensim.parsing.preprocessing import STOPWORDS from itertools import chain #stop words stoplist = list(STOPWORDS) new = ['education','certification','certificate','certified'] stoplist.extend(new) stoplist.sort() #read data dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist() #remove stop words texts =

Inefficiency of topic modelling for text clustering

青春壹個敷衍的年華 提交于 2019-12-02 12:33:14
I tried doing text clustering using LDA, but it isn't giving me distinct clusters. Below is my code #Import libraries from gensim import corpora, models import pandas as pd from gensim.parsing.preprocessing import STOPWORDS from itertools import chain #stop words stoplist = list(STOPWORDS) new = ['education','certification','certificate','certified'] stoplist.extend(new) stoplist.sort() #read data dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist() #remove stop words texts = [[word for word in document.lower().split() if word not in stoplist] for document in dat] #dictionary

Memory error in python using numpy array

∥☆過路亽.° 提交于 2019-12-02 11:11:15
I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1] print('*Topic {}\n- {}'

How to get topic vector of new documents and compare with pre-defined topic model in Mallet?

亡梦爱人 提交于 2019-12-02 00:31:14
问题 I'm trying to somehow compare a sole document's topic distribution (using LDA) with, other files and their topic distributions within a previously created topic model, using MALLET. I know that this can be done through MALLET commands in terminal but I'm having problems in finding a way to implement this in Java. To give a gist of what the functionality of my program is: The already created topic model was created with a large corpus of texts. I want to use this to compare topic distributions

How to get topic vector of new documents and compare with pre-defined topic model in Mallet?

泄露秘密 提交于 2019-12-01 21:48:20
I'm trying to somehow compare a sole document's topic distribution (using LDA) with, other files and their topic distributions within a previously created topic model, using MALLET. I know that this can be done through MALLET commands in terminal but I'm having problems in finding a way to implement this in Java. To give a gist of what the functionality of my program is: The already created topic model was created with a large corpus of texts. I want to use this to compare topic distributions with a tweet that contains a certain hashtag and to then pull out the file most similar to the tweet

How to reproduce exact results with LDA function in R's topicmodels package

*爱你&永不变心* 提交于 2019-12-01 09:29:35
I've been unable to create reproducible results from topicmodels' LDA function. To take an example from their documentation: library(topicmodels) set.seed(0) lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) set.seed(0) lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) identical(lda1, lda2) # [1] FALSE How can I get identical results from two separate calls to LDA? As an aside (in case the package authors are on here), I find the control=list(seed=0) snippet unfortunate and unnecessary. Behind the scenes, there's a line for if (missing(seed)) seed <- as.integer(Sys

How to reproduce exact results with LDA function in R's topicmodels package

醉酒当歌 提交于 2019-12-01 07:18:36
问题 I've been unable to create reproducible results from topicmodels' LDA function. To take an example from their documentation: library(topicmodels) set.seed(0) lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) set.seed(0) lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) identical(lda1, lda2) # [1] FALSE How can I get identical results from two separate calls to LDA? As an aside (in case the package authors are on here), I find the control=list(seed=0) snippet

In R tm package, build corpus FROM Document-Term-Matrix

断了今生、忘了曾经 提交于 2019-12-01 06:35:30
It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab

In R tm package, build corpus FROM Document-Term-Matrix

て烟熏妆下的殇ゞ 提交于 2019-12-01 05:27:16
问题 It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to