What does tf.nn.embedding_lookup function do?

前端 未结 8 609
深忆病人
深忆病人 2020-12-02 04:18
tf.nn.embedding_lookup(params, ids, partition_strategy=\'mod\', name=None)

I cannot understand the duty of this function. Is it like a lookup table

8条回答
  •  再見小時候
    2020-12-02 04:43

    Yes, the purpose of tf.nn.embedding_lookup() function is to perform a lookup in the embedding matrix and return the embeddings (or in simple terms the vector representation) of words.

    A simple embedding matrix (of shape: vocabulary_size x embedding_dimension) would look like below. (i.e. each word will be represented by a vector of numbers; hence the name word2vec)


    Embedding Matrix

    the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
    like 0.36808 0.20834 -0.22319 0.046283 0.20098 0.27515 -0.77127 -0.76804
    between 0.7503 0.71623 -0.27033 0.20059 -0.17008 0.68568 -0.061672 -0.054638
    did 0.042523 -0.21172 0.044739 -0.19248 0.26224 0.0043991 -0.88195 0.55184
    just 0.17698 0.065221 0.28548 -0.4243 0.7499 -0.14892 -0.66786 0.11788
    national -1.1105 0.94945 -0.17078 0.93037 -0.2477 -0.70633 -0.8649 -0.56118
    day 0.11626 0.53897 -0.39514 -0.26027 0.57706 -0.79198 -0.88374 0.30119
    country -0.13531 0.15485 -0.07309 0.034013 -0.054457 -0.20541 -0.60086 -0.22407
    under 0.13721 -0.295 -0.05916 -0.59235 0.02301 0.21884 -0.34254 -0.70213
    such 0.61012 0.33512 -0.53499 0.36139 -0.39866 0.70627 -0.18699 -0.77246
    second -0.29809 0.28069 0.087102 0.54455 0.70003 0.44778 -0.72565 0.62309 
    

    I split the above embedding matrix and loaded only the words in vocab which will be our vocabulary and the corresponding vectors in emb array.

    vocab = ['the','like','between','did','just','national','day','country','under','such','second']
    
    emb = np.array([[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862],
       [0.36808, 0.20834, -0.22319, 0.046283, 0.20098, 0.27515, -0.77127, -0.76804],
       [0.7503, 0.71623, -0.27033, 0.20059, -0.17008, 0.68568, -0.061672, -0.054638],
       [0.042523, -0.21172, 0.044739, -0.19248, 0.26224, 0.0043991, -0.88195, 0.55184],
       [0.17698, 0.065221, 0.28548, -0.4243, 0.7499, -0.14892, -0.66786, 0.11788],
       [-1.1105, 0.94945, -0.17078, 0.93037, -0.2477, -0.70633, -0.8649, -0.56118],
       [0.11626, 0.53897, -0.39514, -0.26027, 0.57706, -0.79198, -0.88374, 0.30119],
       [-0.13531, 0.15485, -0.07309, 0.034013, -0.054457, -0.20541, -0.60086, -0.22407],
       [ 0.13721, -0.295, -0.05916, -0.59235, 0.02301, 0.21884, -0.34254, -0.70213],
       [ 0.61012, 0.33512, -0.53499, 0.36139, -0.39866, 0.70627, -0.18699, -0.77246 ],
       [ -0.29809, 0.28069, 0.087102, 0.54455, 0.70003, 0.44778, -0.72565, 0.62309 ]])
    
    
    emb.shape
    # (11, 8)
    

    Embedding Lookup in TensorFlow

    Now we will see how can we perform embedding lookup for some arbitrary input sentence.

    In [54]: from collections import OrderedDict
    
    # embedding as TF tensor (for now constant; could be tf.Variable() during training)
    In [55]: tf_embedding = tf.constant(emb, dtype=tf.float32)
    
    # input for which we need the embedding
    In [56]: input_str = "like the country"
    
    # build index based on our `vocabulary`
    In [57]: word_to_idx = OrderedDict({w:vocab.index(w) for w in input_str.split() if w in vocab})
    
    # lookup in embedding matrix & return the vectors for the input words
    In [58]: tf.nn.embedding_lookup(tf_embedding, list(word_to_idx.values())).eval()
    Out[58]: 
    array([[ 0.36807999,  0.20834   , -0.22318999,  0.046283  ,  0.20097999,
             0.27515   , -0.77126998, -0.76804   ],
           [ 0.41800001,  0.24968   , -0.41242   ,  0.1217    ,  0.34527001,
            -0.044457  , -0.49687999, -0.17862   ],
           [-0.13530999,  0.15485001, -0.07309   ,  0.034013  , -0.054457  ,
            -0.20541   , -0.60086   , -0.22407   ]], dtype=float32)
    

    Observe how we got the embeddings from our original embedding matrix (with words) using the indices of words in our vocabulary.

    Usually, such an embedding lookup is performed by the first layer (called Embedding layer) which then passes these embeddings to RNN/LSTM/GRU layers for further processing.


    Side Note: Usually the vocabulary will also have a special unk token. So, if a token from our input sentence is not present in our vocabulary, then the index corresponding to unk will be looked up in the embedding matrix.


    P.S. Note that embedding_dimension is a hyperparameter that one has to tune for their application but popular models like Word2Vec and GloVe uses 300 dimension vector for representing each word.

    Bonus Reading word2vec skip-gram model

提交回复
热议问题