PyTorch / Gensim - How to load pre-trained word embeddings

后端未结

关注

 6  767

甜味超标 2020-12-01 02:44

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

So my question is, how do I get the embedding weights loaded by gensim in

6条回答

遥遥无期 (楼主)

2020-12-01 03:20

Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"

I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.

def convert_bin_emb_txt(out_path,emb_file):
    txt_name = basename(emb_file).split(".")[0] +".txt"
    emb_txt_file = os.path.join(out_path,txt_name)
    emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
    emb_model.save_word2vec_format(emb_txt_file,binary=False)
    return emb_txt_file

emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                  cache='custom_embeddings',
                                  unk_init=torch.Tensor.normal_)

TEXT.build_vocab(train_data,
                 max_size=MAX_VOCAB_SIZE,
                 vectors=custom_embeddings,
                 unk_init=torch.Tensor.normal_)

tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.

I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

0 讨论(0)

查看其它6个回答