word2vec

How do I create a Keras Embedding layer from a pre-trained word embedding dataset?

十年热恋 提交于 2019-12-06 07:23:02
问题 How do I load a pre-trained word-embedding into a Keras Embedding layer? I downloaded the glove.6B.50d.txt (glove.6B.zip file from https://nlp.stanford.edu/projects/glove/) and I'm not sure how to add it to a Keras Embedding layer. See: https://keras.io/layers/embeddings/ 回答1: You will need to pass an embeddingMatrix to the Embedding layer as follows: Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable) vocabLen : number of tokens in your vocabulary embDim : embedding

Python实现word2Vec -model

♀尐吖头ヾ 提交于 2019-12-06 06:48:17
import gensim, logging, os logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) import nltk corpus = nltk.corpus.brown.sents() fname = 'brown_skipgram.model' if os.path.exists(fname): # load the file if it has already been trained, to save repeating the slow training step below model = gensim.models.Word2Vec.load(fname) else: # can take a few minutes, grab a cuppa model = gensim.models.Word2Vec(corpus, size=100, min_count=5, workers=2, iter=50) model.save(fname) words = "woman women man girl boy green blue".split() for w1 in words: for w2 in words: print

AttributeError: module 'boto' has no attribute 'plugin'

青春壹個敷衍的年華 提交于 2019-12-06 06:45:35
I'm running a VM on Google Cloud Platform using Jupyter notebook with word2vec models. I have the following code snippet: from gensim.models import Word2Vec amazon_word2vec = Word2Vec(model, min_count=1, size=100) And it results in the error: AttributeError: module 'boto' has no attribute 'plugin' What is the solution to the above problem? pip install google-compute-engine install google compute engine and restart your vm and it works fine. 来源: https://stackoverflow.com/questions/52414249/attributeerror-module-boto-has-no-attribute-plugin

【NLP】【五】gensim之Word2Vec

倖福魔咒の 提交于 2019-12-06 06:40:59
【一】整体流程综述 gensim底层封装了Google的Word2Vec的c接口,借此实现了word2vec。使用gensim接口非常方便,整体流程如下: 1. 数据预处理(分词后的数据) 2. 数据读取 3.模型定义与训练 4.模型保存与加载 5.模型使用(相似度计算,词向量获取) 【二】gensim提供的word2vec主要功能 【三】gensim接口使用示例 1. 使用jieba进行分词。 文本数据:《人民的名义》的小说原文作为语料 百度云盘:https://pan.baidu.com/s/1ggA4QwN # -*- coding:utf-8 -*- import jieba def preprocess_in_the_name_of_people(): with open("in_the_name_of_people.txt",mode='rb') as f: doc = f.read() doc_cut = jieba.cut(doc) result = ' '.join(doc_cut) result = result.encode('utf-8') with open("in_the_name_of_people_cut.txt",mode='wb') as f2: f2.write(result) 2. 使用原始text8.zip进行词向量训练 from

Wor2vec fine tuning

喜欢而已 提交于 2019-12-06 06:38:09
问题 I am new at working with word2vec. I need to fine tune my word2vec model. I have 2 datasets: data1 and data2 what i did so far is : model = gensim.models.Word2Vec( data1, size=size_v, window=size_w, min_count=min_c, workers=work) model.train(data1, total_examples=len(data1), epochs=epochs) model.train(data2, total_examples=len(data2), epochs=epochs) Is this correct? Do I need to store learned weights somewhere? I checked this answer and this one but I couldn't understand how it's done. Can

How to convert gensim Word2Vec model to FastText model?

[亡魂溺海] 提交于 2019-12-06 02:57:43
I have a Word2Vec model which was trained on a huge corpus. While using this model for Neural network application I came across quite a few "Out of Vocabulary" words. Now I need to find word embeddings for these "Out of Vocabulary" words. So I did some googling and found that Facebook has recently released a FastText library for this. Now my question is how can I convert my existing word2vec model or Keyedvectors to FastText model? FastText is able to create vectors for subword fragments by including those fragments in the initial training, from the original corpus. Then, when encountering an

How to handle words that are not in word2vec's vocab optimally

大憨熊 提交于 2019-12-06 02:16:38
I have a list of ~10 million sentences, where each of them contains up to 70 words. I'm running gensim word2vec on every word, and then taking the simple average of each sentence. The problem is that I use min_count=1000, so a lot of words are not in the vocab. To solve that, I intersect the vocab array (that contains about 10000 words) with every sentence, and if there's at least one element left in that intersection, it returns its the simple average, otherwise, it returns a vector of zeros. The issue is that calculating every average takes a very long time when I run it on the whole dataset

Python: What is the “size” parameter in Gensim Word2vec model class

穿精又带淫゛_ 提交于 2019-12-06 02:15:26
问题 I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec size? Thank you. 回答1:

What are doc2vec training iterations?

為{幸葍}努か 提交于 2019-12-06 02:14:49
I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents. However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop ? Please let me know how I should change the following code to train the model for 20 epoches. Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.

Injecting pre-trained word2vec vectors into TensorFlow seq2seq

烈酒焚心 提交于 2019-12-06 01:15:19
问题 I was trying to inject pretrained word2vec vectors into existing tensorflow seq2seq model. Following this answer, I produced the following code. But it doesn't seem to improve performance as it should, although the values in the variable are updated. In my understanding the error might be due to the fact that EmbeddingWrapper or embedding_attention_decoder create embeddings independently of the vocabulary order? What would be the best way to load pretrained vectors into tensorflow model?