PyTorch / Gensim - How to load pre-trained word embeddings

后端 未结 6 777
甜味超标
甜味超标 2020-12-01 02:44

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

So my question is, how do I get the embedding weights loaded by gensim in

6条回答
  •  自闭症患者
    2020-12-01 03:19

    I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):

    import torch
    import torch.nn as nn
    import torchtext.data as data
    import torchtext.vocab as vocab
    
    # use torchtext to define the dataset field containing text
    text_field = data.Field(sequential=True)
    
    # load your dataset using torchtext, e.g.
    dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])
    
    # build vocabulary
    text_field.build_vocab(dataset)
    
    # I use embeddings created with
    # model = gensim.models.Word2Vec(...)
    # model.wv.save_word2vec_format(path_to_embeddings_file)
    
    # load embeddings using torchtext
    vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
    text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
    
    # when defining your network you can then use the method mentioned by blue-phoenox
    embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
    
    # pass data to the layer
    dataset_iter = data.Iterator(dataset, ...)
    for batch in dataset_iter:
        ...
        embedding(batch.text)
    

提交回复
热议问题