I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.
So my question is, how do I get the embedding weights loaded by gensim in
Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"
I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.
def convert_bin_emb_txt(out_path,emb_file):
txt_name = basename(emb_file).split(".")[0] +".txt"
emb_txt_file = os.path.join(out_path,txt_name)
emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
emb_model.save_word2vec_format(emb_txt_file,binary=False)
return emb_txt_file
emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
cache='custom_embeddings',
unk_init=torch.Tensor.normal_)
TEXT.build_vocab(train_data,
max_size=MAX_VOCAB_SIZE,
vectors=custom_embeddings,
unk_init=torch.Tensor.normal_)
tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.
I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.