gensim | 易学教程

Convert word2vec bin file to text

阅读更多关于 Convert word2vec bin file to text

From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c . Supposedly gensim can do this also, but all the tutorials I've found seem to be about converting from text, not the

How to use Gensim doc2vec with pre-trained word vectors?

阅读更多关于 How to use Gensim doc2vec with pre-trained word vectors?

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec? Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training? Thanks. gojomo Note that the "DBOW" ( dm=0 ) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode). (Before gensim 0.12.0, there was the parameter train_words mentioned in another comment,

How to speed up Gensim Word2vec model load time?

阅读更多关于 How to speed up Gensim Word2vec model load time?

I'm building a chatbot so I need to vectorize the user's input using Word2Vec. I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300). So I load the model using Gensim: import gensim model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long. So what can I do to speed up the load time? I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would

Doc2vec: How to get document vectors

阅读更多关于 Doc2vec: How to get document vectors

问题 How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial I am using gensim. doc1=["This is a sentence","This is another sentence"] documents1=[doc.strip().split(" ") for doc in doc1 ] model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4) I get AttributeError: 'list' object has no attribute 'words' whenever I run this. 回答1: If you

Document topical distribution in Gensim LDA

阅读更多关于 Document topical distribution in Gensim LDA

问题 I've derived a LDA topic model using a toy corpus as follows: documents = ['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well

How to create a word cloud from a corpus in Python?

阅读更多关于 How to create a word cloud from a corpus in Python?

From Creating a subset of words from a corpus in R , the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim Mmcorpus into a word cloud? The result will look somewhat like this: Here's a blog post which does just that: http://peekaboo-vision.blogspot.com/2012/11/a-wordcloud-in-python.html The whole code is here: https://github.com/amueller/word_cloud HeadAndTail from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt stopwords = set

Chunkize warning while installing gensim

阅读更多关于 Chunkize warning while installing gensim

问题 I have installed gensim (through pip) in Python. After the installation is over I get the following warning: C:\Python27\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") How can I rectify this? I am unable to import word2vec from gensim.models due to this warning. I have the following configurations: Python 2.7, gensim-0.13.4.1, numpy-1.11.3, scipy-0.18.1, pattern

Update gensim word2vec model

阅读更多关于 Update gensim word2vec model

I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this: new_sentence = ['moscow', 'weather', 'cold'] model.train(new_sentence) and its printing this as logs: 2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features 2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs

LDA model generates different topics everytime i train on the same corpus

阅读更多关于 LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate different topics everytime? And how do i stabilize the topic generation? I'm using this corpus ( http://pastebin.com/WptkKVF0 ) and this list of stopwords ( http://pastebin.com/LL7dqLcj ) and here's my code: from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip from collections import

What is the simplest way to get tfidf with pandas dataframe?

阅读更多关于 What is the simplest way to get tfidf with pandas dataframe?

问题 I want to calculate tf-idf from the documents below. I'm using python and pandas. import pandas as pd df = pd.DataFrame({'docId': [1,2,3], 'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']}) First, I thought I would need to get word_count for each row. So I wrote a simple function: def word_count(sent): word2cnt = dict() for word in sent.split(): if word in word2cnt: word2cnt[word] += 1 else: word2cnt[word] = 1 return word2cnt And then, I