I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.
So my question is, how do I get the embedding weights loaded by gensim into the PyTorch embedding layer.
Thanks in Advance!
I just wanted to report my findings about loading a gensim embedding with PyTorch.
Solution for PyTorch
0.4.0
and newer:
From v0.4.0
there is a new function from_pretrained()
which makes loading an embedding very comfortable.
Here is an example from the documentation.
>> # FloatTensor containing pretrained weights
>> weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
>> embedding = nn.Embedding.from_pretrained(weight)
>> # Get embeddings for index 1
>> input = torch.LongTensor([1])
>> embedding(input)
The weights from gensim can easily be obtained by:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
Solution for PyTorch version
0.3.1
and older:
I'm using version 0.3.1
and from_pretrained()
isn't available in this version.
Therefore I created my own from_pretrained
so I can also use it with 0.3.1
.
Code for from_pretrained
for PyTorch versions 0.3.1
or lower:
def from_pretrained(embeddings, freeze=True):
assert embeddings.dim() == 2, \
'Embeddings parameter is expected to be 2-dimensional'
rows, cols = embeddings.shape
embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
embedding.weight = torch.nn.Parameter(embeddings)
embedding.weight.requires_grad = not freeze
return embedding
The embedding can be loaded then just like this:
embedding = from_pretrained(weights)
I hope this is helpful for someone.
I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.
You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.
I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):
import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab
# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)
# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])
# build vocabulary
text_field.build_vocab(dataset)
# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)
# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
...
embedding(batch.text)
from gensim.models import Word2Vec
model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created
import torch
weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)
I had quite some problems in understanding the documentation myself and there aren't that many good examples around. Hopefully this example helps other people. It is a simple classifier, that takes the pretrained embeddings in the matrix_embeddings
. By setting requires_grad
to false we make sure that we are not changing them.
class InferClassifier(nn.Module):
def __init__(self, input_dim, n_classes, matrix_embeddings):
"""initializes a 2 layer MLP for classification.
There are no non-linearities in the original code, Katia instructed us
to use tanh instead"""
super(InferClassifier, self).__init__()
#dimensionalities
self.input_dim = input_dim
self.n_classes = n_classes
self.hidden_dim = 512
#embedding
self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
self.embeddings.requires_grad = False
#creates a MLP
self.classifier = nn.Sequential(
nn.Linear(self.input_dim, self.hidden_dim),
nn.Tanh(), #not present in the original code.
nn.Linear(self.hidden_dim, self.n_classes))
def forward(self, sentence):
"""forward pass of the classifier
I am not sure it is necessary to make this explicit."""
#get the embeddings for the inputs
u = self.embeddings(sentence)
#forward to the classifier
return self.classifier(x)
sentence
is a vector with the indexes of matrix_embeddings
instead of words.
来源:https://stackoverflow.com/questions/49710537/pytorch-gensim-how-to-load-pre-trained-word-embeddings