gensim | 易学教程

ImportError: No module named py31compat

阅读更多关于 ImportError: No module named py31compat

问题 i am trying to install gensim using sudo -H pip install --upgrade gensim but it is giving me this error : File "setup.py", line 301, in <module> include_package_data=True, File "/usr/lib/python2.7/distutils/core.py", line 151, in setup dist.run_commands() File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands self.run_command(cmd) File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "/usr/local/lib/python2.7/dist-packages/setuptools/command

Failed to load a .bin.gz pre trained words2vecx

阅读更多关于 Failed to load a .bin.gz pre trained words2vecx

问题 I'm trying to load the pre-trained words2vecs which I've found here (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) I used the following command: model = gensim.models.KeyedVectors.load_word2vec_format('word2vec.bin.gz', binary=False) And it throws this error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/deeplearning/anaconda3/lib/python3.6/site- packages/gensim/models/keyedvectors.py", line 193, in load_word2vec_format header = utils.to_unicode

Pickle load: ImportError: No module named doc2vec_ext

阅读更多关于 Pickle load: ImportError: No module named doc2vec_ext

问题 This is the structure I'm dealing with: src/ processing/ station_level/ train_paragraph_vectors.py doc2vec_ext.py word_embeddings_station_level.py I have trained and stored a model in word_embeddings_station_level.py like this: from src.doc2vec_ext import WeightedDoc2Vec # ... model = WeightedDoc2Vec( # ... ) train(model, vocab, station_sentences, num_epochs) # Saving the model -> pickles it model.save(open(model_file, "w")) This is working fine so far. However, I want to load that model in

Gensim installation with python3

阅读更多关于 Gensim installation with python3

问题 I installed Gensim in python3 when I call gensim I got this error. Can someone help? >>> import gensim AttributeError: 'tuple' object has no attribute 'type' 来源： https://stackoverflow.com/questions/54187308/gensim-installation-with-python3

Gensim LDA Multicore Python script runs much too slow

阅读更多关于 Gensim LDA Multicore Python script runs much too slow

问题 I'm running the following python script on a large dataset (around 100 000 items). Currently the execution is unacceptably slow, it would probably take a month to finish at least (no exaggeration). Obviously I would like it to run faster. I've added a comment belong to highlight where I think the bottleneck is. I have written my own database functions which are imported. Any help is appreciated! # -*- coding: utf-8 -*- import database from gensim import corpora, models, similarities, matutils

Gensim doc2vec file stream training worse performance

阅读更多关于 Gensim doc2vec file stream training worse performance

问题 Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties. This is how I used to trin my doc2vec: training_iterations = 20 d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0) d2v.build_vocab(corpus) for epoch in range(training_iterations): d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter) d2v.alpha -= 0.0002 d2v

Python Gensim word2vec vocabulary key

阅读更多关于 Python Gensim word2vec vocabulary key

问题 I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode. # -*- encoding:utf-8 -*- # !/usr/bin/env python import sys reload(sys) sys.setdefaultencoding('utf-8') from gensim.models import Word2Vec import pprint with open('parsed_data.txt', 'r') as f: corpus = map(unicode, f.read().split('\n')) model = Word2Vec(size=128, window=5, min_count=5, workers=4) model.build_vocab(corpus,keep_raw_vocab=False) model.train(corpus) model.save('w2v')

why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

阅读更多关于 why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

问题 In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions: why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix. why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing

gensim word2vec - updating word embeddings with newcoming data

阅读更多关于 gensim word2vec - updating word embeddings with newcoming data

问题 I have trained 26 million tweets with skipgram technique to create word embeddings as follows: sentences = gensim.models.word2vec.LineSentence('/.../data/tweets_26M.txt') model = gensim.models.word2vec.Word2Vec(sentences, window=2, sg=1, size=200, iter=20) model.save_word2vec_format('/.../savedModel/Tweets26M_All.model.bin', binary=True) However, I am continuously collecting more tweets in my database. For example, when I have 2 million more tweets, I wanna update my embeddings with also

add new words to GoogleNews by gensim

阅读更多关于 add new words to GoogleNews by gensim

问题 I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it? #Load GoogleNews pretrained word2vec model model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin"