gensim

How do you initialize a gensim corpus variable with a csr_matrix?

大城市里の小女人 提交于 2019-11-30 07:27:42
I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words, I don't want to download a corpus as shown in gensim's documentation nor convert X to a dense matrix, since it would consume a lot of memory and the computer could hang. In short, my questions are the following, How do you initialize a gensim corpus given that I have a csr_matrix (sparse) representing the whole corpus? How do you use LDA to extract

Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

若如初见. 提交于 2019-11-30 07:03:26
I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) After loading the model I am converting training reviews sentence words into vectors #reading all sentences from training file with open('restaurantSentences', 'r') as infile: x_train = infile.readlines() #cleaning sentences x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train] train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train]) During word2Vec process i

Using word2vec to classify words in categories

五迷三道 提交于 2019-11-30 05:15:21
BACKGROUND I have vectors with some sample data and each vector has a category name (Places,Colors,Names). ['john','jay','dan','nathan','bob'] -> 'Names' ['yellow', 'red','green'] -> 'Colors' ['tokyo','bejing','washington','mumbai'] -> 'Places' My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category. APPROACH I did some research and came across Word2vec . This

Ensure the gensim generate the same Word2Vec model for different runs on the same data

爷,独闯天下 提交于 2019-11-30 04:46:35
问题 In LDA model generates different topics everytime i train on the same corpus , by setting the np.random.seed(0) , the LDA model will always be initialized and trained in exactly the same way. Is it the same for the Word2Vec models from gensim ? By setting the random seed to a constant, would the different run on the same dataset produce the same model? But strangely, it's already giving me the same vector at different instances. >>> from nltk.corpus import brown >>> from gensim.models import

How to remove a word completely from a Word2Vec model in gensim?

北慕城南 提交于 2019-11-30 01:50:19
问题 Given a model, e.g. from gensim.models.word2vec import Word2Vec documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and

How to extract phrases from corpus using gensim

假如想象 提交于 2019-11-29 23:58:50
For preprocessing the corpus I was planing to extarct common phrases from the corpus, for this I tried using Phrases model in gensim, I tried below code but it's not giving me desired output. My code from gensim.models import Phrases documents = ["the mayor of new york was there", "machine learning can be useful sometimes"] sentence_stream = [doc.split(" ") for doc in documents] bigram = Phrases(sentence_stream) sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'] print(bigram[sent]) Output [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'] But it should come as [u'the

Python Gensim: how to calculate document similarity using the LDA model?

自闭症网瘾萝莉.ら 提交于 2019-11-29 20:27:48
I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks! Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query. dictionary = corpora.Dictionary.load('dictionary.dict') corpus = corpora.MmCorpus("corpus.mm") lda = models.LdaModel.load("model.lda") #result from running online lda (training) index

gensim LdaMulticore not multiprocessing?

江枫思渺然 提交于 2019-11-29 10:45:37
When I run gensim's LdaMulticore model on a machine with 12 cores, using: lda = LdaMulticore(corpus, num_topics=64, workers=10) I get a logging message that says using serial LDA version on this node A few lines later, I see another loging message that says training LDA model using 10 processes When I run top, I see 11 python processes have been spawned, but 9 are sleeping, I.e. only one worker is active. The machine has 24 cores, and is not overwhelmed by any means. Why isn't LdaMulticore running in parallel mode? First, make sure you have installed a fast BLAS library , because most of the

How to run tsne on word2vec created from gensim?

故事扮演 提交于 2019-11-29 10:13:28
问题 I want to visualize a word2vec created from gensim library. I tried sklearn but it seems I need to install a developer version to get it. I tried installing the developer version but that is not working on my machine . Is it possible to modify this code to visualize a word2vec model ? tsne_python 回答1: You don't need a developer version of scikit-learn - just install scikit-learn the usual way via pip or conda . To access the word vectors created by word2vec simply use the word dictionary as

How do you initialize a gensim corpus variable with a csr_matrix?

时间秒杀一切 提交于 2019-11-29 09:40:26
问题 I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words, I don't want to download a corpus as shown in gensim's documentation nor convert X to a dense matrix, since it would consume a lot of memory and the computer could hang. In short, my questions are the following, How do you initialize a gensim