gensim | 易学教程

Doc2vec: model.docvecs is only of length 10

阅读更多关于 Doc2vec: model.docvecs is only of length 10

问题 I am trying doc2vec for 600000 rows of sentences and my code is below: model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores) model.build_vocab(res) model.train(res, total_examples=model.corpus_count, epochs=model.iter) #len(res) = 663406 #length of unique words 15581 print(len(model.wv.vocab)) #length of doc vectors is 10 len(model.docvecs) # each of length 100 len(model.docvecs[1]) How do I interpret this result? why is the length of vector only

What are doc2vec training iterations?

阅读更多关于 What are doc2vec training iterations?

问题 I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents. However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop ? Please let me know how I should change the following code to train the model for 20 epoches. Also,

gensim word2vec - array dimensions in updating with online word embedding

阅读更多关于 gensim word2vec - array dimensions in updating with online word embedding

问题 Word2Vec from gensim 0.13.4.1 to update the word vectors on the fly does not work. model.build_vocab(sentences, update=False) works fine; however, model.build_vocab(sentences, update=True) does not. I am using this website to try and emulate what they have done; hence I use the following script at some point: model = gensim.models.Word2Vec() sentences = gensim.models.word2vec.LineSentence("./text8/text8") model.build_vocab(sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000,

How to obtain antonyms through word2vec?

阅读更多关于 How to obtain antonyms through word2vec?

问题 I am currently working on word2vec model using gensim in Python, and want to write a function that can help me find the antonyms and synonyms of a given word. For example: antonym("sad")="happy" synonym("upset")="enraged" Is there a way to do that in word2vec? 回答1: In word2vec you can find analogies, the following way model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) model.most_similar(positive=['good', 'sad'], negative=['bad']) [(u

Loss does not decrease during training (Word2Vec, Gensim)

阅读更多关于 Loss does not decrease during training (Word2Vec, Gensim)

问题 What can cause loss from model.get_latest_training_loss() increase on each epoch? Code, used for training: class EpochSaver(CallbackAny2Vec): '''Callback to save model after each epoch and show training parameters ''' def __init__(self, savedir): self.savedir = savedir self.epoch = 0 os.makedirs(self.savedir, exist_ok=True) def on_epoch_end(self, model): savepath = os.path.join(self.savedir, "model_neg{}_epoch.gz".format(self.epoch)) model.save(savepath) print( "Epoch saved: {}".format(self

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

阅读更多关于 Visualize Gensim Word2vec Embeddings in Tensorboard Projector

问题 I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's live model of the embedding projector. One problem. It didn't work . It told me that the tensors were improperly formatted. So, being a beginner, I thought I would ask some people with more experience about possible solutions. Equivalent to my code:

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

阅读更多关于 Visualize Gensim Word2vec Embeddings in Tensorboard Projector

How to load sentences into Python gensim?

阅读更多关于 How to load sentences into Python gensim?

问题 I am trying to use the word2vec module from gensim natural language processing library in Python. The docs say to initialize the model: from gensim.models import word2vec model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) What format does gensim expect for the input sentences? I have raw text "the quick brown fox jumps over the lazy dogs" "Then a cop quizzed Mick Jagger's ex-wives briefly." etc. What additional processing do I need to post into word2fec ? UPDATE: Here is

Get most similar words, given the vector of the word (not the word itself)

阅读更多关于 Get most similar words, given the vector of the word (not the word itself)

问题 Using the gensim.models.Word2Vec library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words: model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True) model.most_similar(positive=[WORD], topn=N) I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:

gensim - Doc2Vec: MemoryError when training on english Wikipedia

阅读更多关于 gensim - Doc2Vec: MemoryError when training on english Wikipedia

问题 I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if