How to get vector for a sentence from the word2vec of tokens in sentence

后端 未结 9 1867
鱼传尺愫
鱼传尺愫 2020-12-02 04:18

I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of

相关标签:
9条回答
  • 2020-12-02 04:39

    There are several ways to get a vector for a sentence. Each approach has advantages and shortcomings. Choosing one depends on the task you want to perform with your vectors.

    First, you can simply average the vectors from word2vec. According to Le and Mikolov, this approach performs poorly for sentiment analysis tasks, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm". On the other hand, according to Kenter et al. 2016, "simply averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks. A variant would be to weight word vectors with their TF-IDF to decrease the influence of the most common words.

    A more sophisticated approach developed by Socher et al. is to combine word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. This method works for sentences sentiment analysis, because it depends on parsing.

    0 讨论(0)
  • 2020-12-02 04:46

    It depends on the usage:

    1) If you only want to get sentence vector for some known data. Check out paragraph vector in these papers:

    Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. Eprint Arxiv,4:1188–1196.

    A. M. Dai, C. Olah, and Q. V. Le. 2015. DocumentEmbedding with Paragraph Vectors. ArXiv e-prints,July.

    2) If you want a model to estimate sentence vector for unknown(test) sentences with unsupervised approach:

    You could check out this paper:

    Steven Du and Xi Zhang. 2016. Aicyber at SemEval-2016 Task 4: i-vector based sentence representation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), San Diego, US

    3)Researcher are also looking for the output of certain layer in RNN or LSTM network, recent example is:

    http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195

    4)For the gensim doc2vec, many researchers could not get good results, to overcome this problem, following paper using doc2vec based on pre-trained word vectors.

    Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.

    5) tweet2vec or sent2vec .

    Facebook has SentEval project for evaluating the quality of sentence vectors.

    https://github.com/facebookresearch/SentEval

    6) There are more information in the following paper:

    Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering


    And for now you can use 'BERT':

    Google release the source code as well as pretrained models.

    https://github.com/google-research/bert

    And here is an example to run bert as a service:

    https://github.com/hanxiao/bert-as-service

    0 讨论(0)
  • 2020-12-02 04:54

    It is possible, but not from word2vec. The composition of word vectors in order to obtain higher-level representations for sentences (and further for paragraphs and documents) is a really active research topic. There is not one best solution to do this, it really depends on to what task you want to apply these vectors. You can try concatenation, simple summation, pointwise multiplication, convolution etc. There are several publications on this that you can learn from, but ultimately you just need to experiment and see what fits you best.

    0 讨论(0)
提交回复
热议问题