I have checked the PyTorch tutorial and questions similar to this one on Stackoverflow.
I get confused; does the embedding in pytorch (Embedding) make the similar words
torch.nn.Embedding just creates a Lookup Table, to get the word embedding given a word index.
from collections import Counter
import torch.nn as nn
# Let's say you have 2 sentences(lowercased, punctuations removed) :
sentences = "i am new to PyTorch i am having fun"
words = sentences.split(' ')
vocab = Counter(words) # create a dictionary
vocab = sorted(vocab, key=vocab.get, reverse=True)
vocab_size = len(vocab)
# map words to unique indices
word2idx = {word: ind for ind, word in enumerate(vocab)}
# word2idx = {'i': 0, 'am': 1, 'new': 2, 'to': 3, 'pytorch': 4, 'having': 5, 'fun': 6}
encoded_sentences = [word2idx[word] for word in words]
# encoded_sentences = [0, 1, 2, 3, 4, 0, 1, 5, 6]
# let's say you want embedding dimension to be 3
emb_dim = 3
Now, embedding layer can be initialized as :
emb_layer = nn.Embedding(vocab_size, emb_dim)
word_vectors = emb_layer(torch.LongTensor(encoded_sentences))
This initializes embeddings from a standard Normal distribution(that is 0 mean and unit variance). Thus, these word vectors don't have any sense of 'relatedness'.
word_vectors is a torch tensor of size (9,3). (since there are 9 words in our data)
emb_layer has one trainable parameter called weight, which is, by default, set to be trained. You can check it by :
emb_layer.weight.requires_grad
which returns True. If you don't want to train your embeddings during model training(say, when you are using pre-trained embeddings), you can set them to False by :
emb_layer.weight.requires_grad = False
If your vocabulary size is 10,000 and you wish to initialize embeddings using pre-trained embeddings, say, Word2Vec, do it as :
emb_layer.load_state_dict({'weight': torch.from_numpy(emb_mat)})
here, emb_mat is a Numpy matrix of size (10,000, 300) containing 300-dimensional Word2vec word vectors for each of the 10,000 words in your vocabulary.
Now, the embedding layer is loaded with Word2Vec word representations.