PyTorch / Gensim - How to load pre-trained word embeddings

后端 未结 6 766
甜味超标
甜味超标 2020-12-01 02:44

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

So my question is, how do I get the embedding weights loaded by gensim in

相关标签:
6条回答
  • 2020-12-01 03:19

    I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):

    import torch
    import torch.nn as nn
    import torchtext.data as data
    import torchtext.vocab as vocab
    
    # use torchtext to define the dataset field containing text
    text_field = data.Field(sequential=True)
    
    # load your dataset using torchtext, e.g.
    dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])
    
    # build vocabulary
    text_field.build_vocab(dataset)
    
    # I use embeddings created with
    # model = gensim.models.Word2Vec(...)
    # model.wv.save_word2vec_format(path_to_embeddings_file)
    
    # load embeddings using torchtext
    vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
    text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
    
    # when defining your network you can then use the method mentioned by blue-phoenox
    embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
    
    # pass data to the layer
    dataset_iter = data.Iterator(dataset, ...)
    for batch in dataset_iter:
        ...
        embedding(batch.text)
    
    0 讨论(0)
  • 2020-12-01 03:20

    Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"

    I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.

    def convert_bin_emb_txt(out_path,emb_file):
        txt_name = basename(emb_file).split(".")[0] +".txt"
        emb_txt_file = os.path.join(out_path,txt_name)
        emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
        emb_model.save_word2vec_format(emb_txt_file,binary=False)
        return emb_txt_file
    
    emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
    custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                      cache='custom_embeddings',
                                      unk_init=torch.Tensor.normal_)
    
    TEXT.build_vocab(train_data,
                     max_size=MAX_VOCAB_SIZE,
                     vectors=custom_embeddings,
                     unk_init=torch.Tensor.normal_)
    

    tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.

    I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

    0 讨论(0)
  • 2020-12-01 03:31

    I had quite some problems in understanding the documentation myself and there aren't that many good examples around. Hopefully this example helps other people. It is a simple classifier, that takes the pretrained embeddings in the matrix_embeddings. By setting requires_grad to false we make sure that we are not changing them.

    class InferClassifier(nn.Module):
      def __init__(self, input_dim, n_classes, matrix_embeddings):
        """initializes a 2 layer MLP for classification.
        There are no non-linearities in the original code, Katia instructed us 
        to use tanh instead"""
    
        super(InferClassifier, self).__init__()
    
        #dimensionalities
        self.input_dim = input_dim
        self.n_classes = n_classes
        self.hidden_dim = 512
    
        #embedding
        self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
        self.embeddings.requires_grad = False
    
        #creates a MLP
        self.classifier = nn.Sequential(
                nn.Linear(self.input_dim, self.hidden_dim),
                nn.Tanh(), #not present in the original code.
                nn.Linear(self.hidden_dim, self.n_classes))
    
      def forward(self, sentence):
        """forward pass of the classifier
        I am not sure it is necessary to make this explicit."""
    
        #get the embeddings for the inputs
        u = self.embeddings(sentence)
    
        #forward to the classifier
        return self.classifier(x)
    

    sentence is a vector with the indexes of matrix_embeddings instead of words.

    0 讨论(0)
  • 2020-12-01 03:32
    from gensim.models import Word2Vec
    
    model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
    #gensim model created
    
    import torch
    
    weights = torch.FloatTensor(model.wv.vectors)
    embedding = nn.Embedding.from_pretrained(weights)
    
    0 讨论(0)
  • 2020-12-01 03:34

    I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.

    You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

    0 讨论(0)
  • 2020-12-01 03:36

    I just wanted to report my findings about loading a gensim embedding with PyTorch.


    • Solution for PyTorch 0.4.0 and newer:

    From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable. Here is an example from the documentation.

    import torch
    import torch.nn as nn
    
    # FloatTensor containing pretrained weights
    weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
    embedding = nn.Embedding.from_pretrained(weight)
    # Get embeddings for index 1
    input = torch.LongTensor([1])
    embedding(input)
    

    The weights from gensim can easily be obtained by:

    import gensim
    model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
    weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
    

    As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv:

    weights = model.wv
    

    • Solution for PyTorch version 0.3.1 and older:

    I'm using version 0.3.1 and from_pretrained() isn't available in this version.

    Therefore I created my own from_pretrained so I can also use it with 0.3.1.

    Code for from_pretrained for PyTorch versions 0.3.1 or lower:

    def from_pretrained(embeddings, freeze=True):
        assert embeddings.dim() == 2, \
             'Embeddings parameter is expected to be 2-dimensional'
        rows, cols = embeddings.shape
        embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
        embedding.weight = torch.nn.Parameter(embeddings)
        embedding.weight.requires_grad = not freeze
        return embedding
    

    The embedding can be loaded then just like this:

    embedding = from_pretrained(weights)
    

    I hope this is helpful for someone.

    0 讨论(0)
提交回复
热议问题