Embedding in pytorch

后端 未结 4 795
清歌不尽
清歌不尽 2021-01-30 08:38

I have checked the PyTorch tutorial and questions similar to this one on Stackoverflow.

I get confused; does the embedding in pytorch (Embedding) make the similar words

4条回答
  •  感动是毒
    2021-01-30 09:29

    torch.nn.Embedding just creates a Lookup Table, to get the word embedding given a word index.

    from collections import Counter
    import torch.nn as nn
    
    # Let's say you have 2 sentences(lowercased, punctuations removed) :
    sentences = "i am new to PyTorch i am having fun"
    
    words = sentences.split(' ')
        
    vocab = Counter(words) # create a dictionary
    vocab = sorted(vocab, key=vocab.get, reverse=True)
    vocab_size = len(vocab)
    
    # map words to unique indices
    word2idx = {word: ind for ind, word in enumerate(vocab)} 
    
    # word2idx = {'i': 0, 'am': 1, 'new': 2, 'to': 3, 'pytorch': 4, 'having': 5, 'fun': 6}
    
    encoded_sentences = [word2idx[word] for word in words]
    
    # encoded_sentences = [0, 1, 2, 3, 4, 0, 1, 5, 6]
    
    # let's say you want embedding dimension to be 3
    emb_dim = 3 
    

    Now, embedding layer can be initialized as :

    emb_layer = nn.Embedding(vocab_size, emb_dim)
    word_vectors = emb_layer(torch.LongTensor(encoded_sentences))
    

    This initializes embeddings from a standard Normal distribution(that is 0 mean and unit variance). Thus, these word vectors don't have any sense of 'relatedness'.

    word_vectors is a torch tensor of size (9,3). (since there are 9 words in our data)

    emb_layer has one trainable parameter called weight, which is, by default, set to be trained. You can check it by :

    emb_layer.weight.requires_grad
    

    which returns True. If you don't want to train your embeddings during model training(say, when you are using pre-trained embeddings), you can set them to False by :

    emb_layer.weight.requires_grad = False
    

    If your vocabulary size is 10,000 and you wish to initialize embeddings using pre-trained embeddings, say, Word2Vec, do it as :

    emb_layer.load_state_dict({'weight': torch.from_numpy(emb_mat)})
    

    here, emb_mat is a Numpy matrix of size (10,000, 300) containing 300-dimensional Word2vec word vectors for each of the 10,000 words in your vocabulary.

    Now, the embedding layer is loaded with Word2Vec word representations.

提交回复
热议问题