How to get vector for a sentence from the word2vec of tokens in sentence

后端 未结 9 1865
鱼传尺愫
鱼传尺愫 2020-12-02 04:18

I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of

相关标签:
9条回答
  • 2020-12-02 04:30

    Deep averaging network (DAN) can provide sentence embeddings in which word bi-grams are averaged and passed through feedforward deep neural network(DNN).

    It is found that transfer learning using sentence embeddings tends to outperform word level transfer as it preserves the semantic relationship.

    You don't need to start the training from scratch, the pretrained DAN models are available for perusal ( Check Universal Sentence Encoder module in google hub).

    0 讨论(0)
  • 2020-12-02 04:32

    let suppose this is current sentence

    import gensim 
    from gensim.models import Word2Vec
    from gensim import models
    model = gensim.models.KeyedVectors.load_word2vec_format('path of your trainig 
    dataset', binary=True)
    
    strr = 'i am'
    strr2 = strr.split()
    print(strr2)
    model[strr2] //this the the sentance embeddings.
    
    0 讨论(0)
  • 2020-12-02 04:33

    There are differet methods to get the sentence vectors :

    1. Doc2Vec : you can train your dataset using Doc2Vec and then use the sentence vectors.
    2. Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.
    3. Average of Word2Vec vectors with TF-IDF : this is one of the best approach which I will recommend. Just take the word vectors and multiply it with their TF-IDF scores. Just take the average and it will represent your sentence vector.
    0 讨论(0)
  • 2020-12-02 04:35

    I've had good results from:

    1. Summing the word vectors (with tf-idf weighting). This ignores word order, but for many applications is sufficient (especially for short documents)
    2. Fastsent
    0 讨论(0)
  • 2020-12-02 04:36

    You can get vector representations of sentences during training phase (join the test and train sentences in a single file and run word2vec code obtained from following link).

    Code for sentence2vec has been shared by Tomas Mikolov here. It assumes first word of a line to be sentence-id. Compile the code using

    gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
    

    and run it using

    ./word2vec -train alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
    

    EDIT

    Gensim (development version) seems to have a method to infer vectors of new sentences. Check out model.infer_vector(NewDocument) method in https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py

    0 讨论(0)
  • 2020-12-02 04:37

    Google's Universal Sentence Encoder embeddings are an updated solution to this problem. It doesn't use Word2vec but results in a competing solution.

    Here is a walk-through with TFHub and Keras.

    0 讨论(0)
提交回复
热议问题