可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.
e.g.
trained_model.similarity('woman', 'man') 0.73723527
However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn't seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal?
回答1:
This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. "he walked to the store yesterday" and "yesterday, he walked to the store"), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.
The simplest thing you could try -- though I don't know how well this would perform and it would certainly not give you the optimal results -- would be to first remove all "stop" words (words like "the", "an", etc. that don't add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you'll at least not be subject to word order. That being said, this will fail in lots of ways and isn't a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).
So, short answer is, no, there's no easy way to do this (at least not to do it well).
回答2:
Since you're using gensim, you should probably use it's doc2vec implementation. doc2vec is an extension of word2vec to the phrase-, sentence-, and document-level. It's a pretty simple extension, described here
http://cs.stanford.edu/~quocle/paragraph_vector.pdf
Gensim is nice because it's intuitive, fast, and flexible. What's great is that you can grab the pretrained word embeddings from the official word2vec page and the syn0 layer of gensim's Doc2Vec model is exposed so that you can seed the word embeddings with these high quality vectors!
GoogleNews-vectors-negative300.bin.gz
I think gensim is definitely the easiest (and so far for me, the best) tool for embedding a sentence in a vector space.
There exist other sentence-to-vector techniques than the one proposed in Le & Mikolov's paper above. Socher and Manning from Stanford are certainly two of the most famous researchers working in this area. Their work has been based on the principle of compositionally - semantics of the sentence come from:
1. semantics of the words 2. rules for how these words interact and combine into phrases
They've proposed a few such models (getting increasingly more complex) for how to use compositionality to build sentence-level representations.
2011 - unfolding recursive autoencoder (very comparatively simple. start here if interested)
2012 - matrix-vector neural network
2013(?) - neural tensor network
2015 - Tree LSTM
his papers are all available at socher.org. Some of these models are available, but I'd still recommend gensim's doc2vec. For one, the 2011 URAE isn't particularly powerful. In addition, it comes pretrained with weights suited for paraphrasing news-y data. The code he provides does not allow you to retrain the network. You also can't swap in different word vectors, so you're stuck with 2011's pre-word2vec embeddings from Turian. These vectors are certainly not on the level of word2vec's or GloVe's.
Haven't worked with the Tree LSTM yet, but it seems very promising!
tl;dr Yeah, use gensim's doc2vec. But other methods do exist!
回答3:
If you are using word2vec, you need to calculate the average vector for all words in every sentence/document and use cosine similarity between vectors:
import numpy as np from scipy import spatial index2word_set = set(model.index2word) def avg_feature_vector(sentence, model, num_features, index2word_set): words = sentence.split() feature_vec = np.zeros((num_features, ), dtype='float32') n_words = 0 for word in words: if word in index2word_set: n_words += 1 feature_vec = np.add(feature_vec, model[word]) if (n_words > 0): feature_vec = np.divide(feature_vec, n_words) return feature_vec
Calculate similarity:
s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set) s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set) sim = 1 - spatial.distance.cosine(s1_afv, s2_afv) print(sim) > 0.915479828613
回答4:
Once you compute the sum of the two sets of word vectors, you should take the cosine between the vectors, not the diff. The cosine can be computed by taking the dot product of the two vectors normalized. Thus, the word count is not a factor.
回答5:
you can use Word Mover's Distance algorithm. here is an easy description about WMD.
#load word2vec model, here GoogleNews is used model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True) #two sample sentences s1 = 'the first sentence' s2 = 'the second text' #calculate distance between two sentences using WMD algorithm distance = model.wmdistance(s1, s2) print ('distance = %.3f' % distance)
P.s.: if you face an error about import pyemd library, you can install it using following command:
pip install pyemd
回答6:
I am using the following method and it works well. You first need to run a POSTagger and then filter your sentence to get rid of the stop words (determinants, conjunctions, ...). I recommend TextBlob APTagger. Then you build a word2vec by taking the mean of each word vector in the sentence. The n_similarity method in Gemsim word2vec does exactly that by allowing to pass two sets of words to compare.
回答7:
There are extensions of Word2Vec intended to solve the problem of comparing longer pieces of text like phrases or sentences. One of them is paragraph2vec or doc2vec.
"Distributed Representations of Sentences and Documents" http://cs.stanford.edu/~quocle/paragraph_vector.pdf
http://rare-technologies.com/doc2vec-tutorial/
回答8:
I would like to update the existing solution to help the people who are going to calculate the semantic similarity of sentences.
Step 1:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
Step 2 : Computing the sentence vector
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
This is the most simple and efficient method to compute the sentence similarity.
回答9:
I have tried the methods provided by the previous answers. It works, but the main drawback of it is that the longer the sentences the larger similarity will be(to calculate the similarity I use the cosine score of the two mean embeddings of any two sentences) since the more the words the more positive semantic effects will be added to the sentence.
I thought I should change my mind and use the sentence embedding instead as studied in this paper and this.