word2vec

How does word2vec give one hot word vector from the embedding vector?

…衆ロ難τιáo~ 提交于 2019-12-12 03:42:53
问题 I understand how word2vec works. I want to use word2vec(skip-gram) as input for RNN. Input is embedding word vector. Output is also embedding word vector generated by RNN. Here’s question! How can I convert the output vector to one hot word vector? I need inverse matrix of embeddings but I don’t have! 回答1: The output of an RNN is not an embedding. We convert the output from the last layer in an RNN cell into a vector of vocabulary_size by multiplying with an appropriate matrix. Take a look at

How to Cluster words and phrases with pre-trained model on Gensim

你。 提交于 2019-12-11 19:45:26
问题 What I want exactly is to cluster words and phrases, e.g. knitting/knit loom/loom knitting/weaving loom/rainbow loom/home decoration accessories/loom knit/knitting loom/...And I don'd have corpus while I have only the words/phrases. Could I use a pre-trained model like the one from GoogleNews/Wikipedia/... to realise it? I am trying now to use Gensim to load GoogleNews pre-trained model to get phrases similarity. I've been told that The GoogleNews model includes vectors of phrases and words.

Ignore out-of-vocabulary words when averaging vectors in Spacy

人盡茶涼 提交于 2019-12-11 19:09:59
问题 I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings. To do this I use the following code: import spacy nlp = spacy.load('myspacy.bioword2vec.model') sentence = "I love Stack Overflow butitsalsodistractive" avg_vector = nlp(sentence).vector Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided

Triplet loss on text embeddings with keras

≡放荡痞女 提交于 2019-12-11 16:08:28
问题 I'd start saying i'm quite new to Keras and machine learning in general. I'm trying to build an "experimental" model consisting of two parts: An "encoder" which takes a string (containing a long series of attributes, i'm using the DBLP-ACM dataset), builds an embedding of the words of this string (word2vec), and encodes them in a vector (bidirectional LSTM). A trainable model which takes 3 vectors in input (result of model 1) and uses the triplet loss as loss function (i already defined it,

Word2Vec is it for word only in a sentence or for features as well?

前提是你 提交于 2019-12-11 15:44:53
问题 I would like to ask more about Word2Vec: I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence. From my understanding; 1) Feature extractions : Lemma 0, lemma 1, lemma 2 2) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it)) More explanation: Sentence

Why can I only retrieve Array[Float] word vectors but have to pass mllib.linalg.Vector to w2v model?

故事扮演 提交于 2019-12-11 11:32:56
问题 I am having trained a word vector model and now I'd like to do some operations on those vectors. Currently I try to figure out how to e.g. add up some vectors like below and then get some synonyms from the resulting vector. The problem is that model.findSynonyms(org.apache.spark.mllib.linalg.Vector, Int) is making problems since I only get Array[Float] from my model . This is why I try to create a DenseVector which itself now needs Array[Double] and the chaos is perfect - but take a look

How to save the tensorflow's word2vec in text/binary file for later use of kNN output?

别等时光非礼了梦想. 提交于 2019-12-11 07:25:33
问题 I have trained a word2vec model in tensorflow. But when I save the session, it only outputted model.ckpt.data / .index / .meta files. I was thinking of implementing KNN method in retrieving nearest words. I saw answers of using gensim, but how can I save my tensorflow word2vec model into .txt first? 回答1: Simply evaluate the embeddings matrix into a numpy array and write it to the file along with resolved words. Sample code: vocabulary_size = 50000 embedding_size = 128 # Assume your word to

'Word2Vec' object has no attribute 'index2word'

不羁的心 提交于 2019-12-11 07:05:12
问题 I'm getting this error "AttributeError: 'Word2Vec' object has no attribute 'index2word'" in following code in python. Anyone knows how can I solve it? Acctually "tfidf_weighted_averaged_word_vectorizer" throws the error. "obli.csv" contains line of sentences. Thank you. from feature_extractors import tfidf_weighted_averaged_word_vectorizer dataset = get_data2() corpus, labels = dataset.data, dataset.target corpus, labels = remove_empty_docs(corpus, labels) # print('Actual class label:',

Gensim equivalent of training steps

江枫思渺然 提交于 2019-12-11 07:01:35
问题 Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic? If not, what default value does gensim use? Is the gensim parameter iter related to training steps? The TensorFlow script includes this section. with tf.Session(graph=graph) as session: # We must initialize all variables before we use them. init.run() print('Initialized') average_loss = 0 for step in xrange(num_steps): batch_inputs, batch_labels = generate

ValueError: cannot reshape array of size 3800 into shape (1,200)

狂风中的少年 提交于 2019-12-11 06:47:41
问题 I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow: def word_vector(tokens, size): vec = np.zeros(size).reshape((1, size)) count = 0. for word in tokens: try: vec += model_w2v[word].reshape((1, size)) count += 1. except KeyError: # handling the case where the token is not in vocabulary continue if count != 0: vec /= count return vec Next, when I try to Prepare word2vec