word2vec

word2vec原理(一) CBOW与Skip-Gram模型基础

冷暖自知 提交于 2019-12-04 17:42:36
word2vec作为神经概率语言模型的输入,其本身其实是神经概率模型的副产品,是为了通过神经网络学习某个语言模型而产生的中间结果。具体来说,“某个语言模型”指的是“CBOW”和“Skip-gram”。具体学习过程会用到两个降低复杂度的近似方法——Hierarchical Softmax或Negative Sampling。两个模型乘以两种方法,一共有四种实现。 一、CBOW 1.一个单词上下文 2.参数更新 23多个单词上下文 二、Skip-gram 1.网络结构 2.参数更新 三、优化 原始的CBOW模型和Skip-Gram 模型的计算量太大,非常难以计算。 模型在计算网络输出的时候,需要计算误差 。对于CBOW 模型,需要计算$V$个误差(词汇表的大小),对于 Skip-Gram 模型,需要计算$CV$个误差。 另外,每个误差的计算需要用到 softmax 函数,该 softmax 函数涉及到O(V)复杂度的运算:$\sum _{j=1} ^ V exp(u_j)$ 。 每次梯度更新都需要计算网络输出。 如果词汇表有 100万 单词,模型迭代 100 次,则计算量超过 1 亿次。 虽然输入向量的维度也很高,但是由于输入向量只有一位为 1,其它位均为 0,因此输入的总体计算复杂度较小。 word2vec 优化的主要思想是: 限制输出单元的数量。 事实上在上百万的输出单元中

How do I create a Keras Embedding layer from a pre-trained word embedding dataset?

孤者浪人 提交于 2019-12-04 16:41:29
How do I load a pre-trained word-embedding into a Keras Embedding layer? I downloaded the glove.6B.50d.txt (glove.6B.zip file from https://nlp.stanford.edu/projects/glove/ ) and I'm not sure how to add it to a Keras Embedding layer. See: https://keras.io/layers/embeddings/ You will need to pass an embeddingMatrix to the Embedding layer as follows: Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable) vocabLen : number of tokens in your vocabulary embDim : embedding vectors dimension (50 in your example) embeddingMatrix : embedding matrix built from glove.6B.50d.txt

Load gensim Word2Vec computed in Python 2, in Python 3

。_饼干妹妹 提交于 2019-12-04 16:05:45
I have a gensim Word2Vec model computed in Python 2 like that: from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence model = Word2Vec(LineSentence('enwiki.txt'), size=100, window=5, min_count=5, workers=15) model.save('w2v.model') However, I need to use it in Python 3. If I try to load it, import gensim from gensim.models import Word2Vec model = Word2Vec.load('w2v.model') it results in an error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xf9 in position 0: ordinal not in range(128) I suppose the problem is in differences in encoding between Python2 and

How can I access output embedding(output vector) in gensim word2vec?

北城以北 提交于 2019-12-04 13:49:35
问题 I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings). I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling. But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg. Here is what I got. IN[1]: model = Word2Vec.load('test_model.model') IN[2]: model.most_similar([model.syn1neg[0]]) OUT[2]: [('of', -0.04402521997690201), (

Wor2vec fine tuning

怎甘沉沦 提交于 2019-12-04 12:28:24
I am new at working with word2vec. I need to fine tune my word2vec model. I have 2 datasets: data1 and data2 what i did so far is : model = gensim.models.Word2Vec( data1, size=size_v, window=size_w, min_count=min_c, workers=work) model.train(data1, total_examples=len(data1), epochs=epochs) model.train(data2, total_examples=len(data2), epochs=epochs) Is this correct? Do I need to store learned weights somewhere? I checked this answer and this one but I couldn't understand how it's done. Can someone explain to me the steps to follow? Thank you in advance Note you don't need to call train() with

Merging pretrained models in Word2Vec?

北战南征 提交于 2019-12-04 11:46:36
问题 I have download 100 billion word Google news pretrained vector file. On top of that i am also training my own 3gb data producing another pretrained vector file. Both has 300 feature dimensions and more than 1gb size. How do i merge these two huge pre-trained vectors? or how do i train a new model and update vectors on top of another? I see that C based word2vec does not support batch training. I am looking to compute word analogy from these two models. I believe that vectors learned from

How to find synonyms based on word2vec

耗尽温柔 提交于 2019-12-04 11:22:01
I 'm working on word2vec model using gensim in Python, but I found that the result are the words having the same theme, synonyms are only part of the result. Can I find synonyms of a word based on the work I have done? Any replies will be appreciated! Word2vec tends to indicate similar words – but as you've probably seen, the kind of similarity it learns includes more than just pure synonyms. For example, word2vec similarities include words that appear in similar contexts, such as alternatives including even opposites. (After all, 'hot' and 'cold' are very similar words in many ways – both

Python: What is the “size” parameter in Gensim Word2vec model class

和自甴很熟 提交于 2019-12-04 08:45:25
I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec size? Thank you. size is, as you note, the dimensionality of the vector. Word2Vec needs large, varied text examples to create

How to concatenate word vectors to form sentence vector

╄→гoц情女王★ 提交于 2019-12-04 05:03:01
问题 I have learned in some essays (Tomas Mikolov...) that a better way of forming the vector for a sentence is to concatenate the word-vector. but due to my clumsy in mathematics, I am still not sure about the details. for example, supposing that the dimension of word vector is m; and that a sentence has n words. what will be the correct result of concatenating operation? is it a row vector of 1 x m*n ? or a matrix of m x n ? please advise thanks 回答1: There are at least three common ways to

Gensim Word2Vec select minor set of word vectors from pretrained model

痞子三分冷 提交于 2019-12-04 03:56:15
问题 I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model. The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer. Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words? 回答1: Thanks to this