word2vec

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

一曲冷凌霜 提交于 2019-12-03 10:32:14
I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document. However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential? Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit

How can I access output embedding(output vector) in gensim word2vec?

♀尐吖头ヾ 提交于 2019-12-03 08:55:36
I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings) . I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling. But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg. Here is what I got. IN[1]: model = Word2Vec.load('test_model.model') IN[2]: model.most_similar([model.syn1neg[0]]) OUT[2]: [('of', -0.04402521997690201), ('has', -0.16387106478214264), ('in', -0.16650712490081787), ('is', -0.18117375671863556), ('by', -0

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

匿名 (未验证) 提交于 2019-12-03 08:44:33
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying to do the following kaggle assignmnet . I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below. -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py Traceback (most recent call last): File "prog_w2v.py", line 7, in <module> models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec

Load gensim Word2Vec computed in Python 2, in Python 3

匿名 (未验证) 提交于 2019-12-03 07:50:05
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a gensim Word2Vec model computed in Python 2 like that: from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence model = Word2Vec(LineSentence('enwiki.txt'), size=100, window=5, min_count=5, workers=15) model.save('w2v.model') However, I need to use it in Python 3. If I try to load it, import gensim from gensim.models import Word2Vec model = Word2Vec.load('w2v.model') it results in an error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xf9 in position 0: ordinal not in range(128) I suppose the problem

AttributeError: module &#039;tensorflow.models.embedding.gen_word2vec&#039; has no attribute &#039;skipgram_word2vec&#039;

匿名 (未验证) 提交于 2019-12-03 07:50:05
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am new to Tensorflow and I am running the tutorial of word2vec embedding code ( https://github.com/tensorflow/models/tree/master/tutorials/embedding ) on Tensorflow (cpu-only), OS X: 10.11.6. I installed tensorflow via pip install . Running word2vec_basic.py can reach the expected result, but when it turns to word2vec.py and word2vec_optimized.py , the following error is displayed: 回答1: You'll need to use bazel to build the directory, since the op 'skipgram_word2vec' is defined in C++ and not in Python. 文章来源: AttributeError: module

Merging pretrained models in Word2Vec?

淺唱寂寞╮ 提交于 2019-12-03 07:13:50
I have download 100 billion word Google news pretrained vector file. On top of that i am also training my own 3gb data producing another pretrained vector file. Both has 300 feature dimensions and more than 1gb size. How do i merge these two huge pre-trained vectors? or how do i train a new model and update vectors on top of another? I see that C based word2vec does not support batch training. I am looking to compute word analogy from these two models. I believe that vectors learned from these two sources will produce pretty good results. There's no straightforward way to merge the end-results

Why does word2Vec use cosine similarity?

北城以北 提交于 2019-12-03 07:03:39
问题 I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts. However, I do not understand why cosine is the correct measure of word similarity. Cosine similarity says that two vectors point in the same direction, but they could have different magnitudes. For example, cosine similarity makes sense comparing bag-of-words for documents. Two documents might be of different length,

Use LSTM tutorial code to predict next word in a sentence?

邮差的信 提交于 2019-12-03 04:18:44
问题 I've been trying to understand the sample code with https://www.tensorflow.org/tutorials/recurrent which you can find at https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py (Using tensorflow 1.3.0.) I've summarized (what I think are) the key parts, for my question, below: size = 200 vocab_size = 10000 layers = 2 # input_.input_data is a 2D tensor [batch_size, num_steps] of # word ids, from 1 to 10000 cell = tf.contrib.rnn.MultiRNNCell( [tf.contrib.rnn

How to Train GloVe algorithm on my own corpus

浪子不回头ぞ 提交于 2019-12-03 02:54:49
I tried to follow this. But some how I wasted a lot of time ending up with nothing useful. I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file). I downloaded the files provided in the link above and compiled it using cygwin (after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt . should I leave CORPUS=text8 unchanged?) the output was: cooccurrence.bin cooccurrence.shuf.bin text8 corpus.txt vectors.txt How can I used those files to load it as a GloVe model on python? You can do it using GloVe library: Install it: pip install glove_python Then: from glove

How to find the closest word to a vector using word2vec

浪子不回头ぞ 提交于 2019-12-03 02:28:36
I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors: array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) Is there a straight forward way to find the most similar word in my training data to this vector? Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest one? Thanks. For gensim implementation of word2vec there is most_similar() function that lets you find