Embedding in pytorch

后端未结

关注

 4  795

清歌不尽 2021-01-30 08:38

I have checked the PyTorch tutorial and questions similar to this one on Stackoverflow.

I get confused; does the embedding in pytorch (Embedding) make the similar words

4条回答

感动是毒 (楼主)

2021-01-30 09:29
torch.nn.Embedding just creates a Lookup Table, to get the word embedding given a word index.
```
from collections import Counter
import torch.nn as nn

# Let's say you have 2 sentences(lowercased, punctuations removed) :
sentences = "i am new to PyTorch i am having fun"

words = sentences.split(' ')
    
vocab = Counter(words) # create a dictionary
vocab = sorted(vocab, key=vocab.get, reverse=True)
vocab_size = len(vocab)

# map words to unique indices
word2idx = {word: ind for ind, word in enumerate(vocab)} 

# word2idx = {'i': 0, 'am': 1, 'new': 2, 'to': 3, 'pytorch': 4, 'having': 5, 'fun': 6}

encoded_sentences = [word2idx[word] for word in words]

# encoded_sentences = [0, 1, 2, 3, 4, 0, 1, 5, 6]

# let's say you want embedding dimension to be 3
emb_dim = 3 
```
Now, embedding layer can be initialized as :
```
emb_layer = nn.Embedding(vocab_size, emb_dim)
word_vectors = emb_layer(torch.LongTensor(encoded_sentences))
```
This initializes embeddings from a standard Normal distribution(that is 0 mean and unit variance). Thus, these word vectors don't have any sense of 'relatedness'.

word_vectors is a torch tensor of size (9,3). (since there are 9 words in our data)

emb_layer has one trainable parameter called weight, which is, by default, set to be trained. You can check it by :
```
emb_layer.weight.requires_grad
```
which returns True. If you don't want to train your embeddings during model training(say, when you are using pre-trained embeddings), you can set them to False by :
```
emb_layer.weight.requires_grad = False
```
If your vocabulary size is 10,000 and you wish to initialize embeddings using pre-trained embeddings, say, Word2Vec, do it as :
```
emb_layer.load_state_dict({'weight': torch.from_numpy(emb_mat)})
```
here, emb_mat is a Numpy matrix of size (10,000, 300) containing 300-dimensional Word2vec word vectors for each of the 10,000 words in your vocabulary.

Now, the embedding layer is loaded with Word2Vec word representations.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...