PyTorch / Gensim - How to load pre-trained word embeddings

后端 未结 6 767
甜味超标
甜味超标 2020-12-01 02:44

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

So my question is, how do I get the embedding weights loaded by gensim in

6条回答
  •  遥遥无期
    2020-12-01 03:20

    Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"

    I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.

    def convert_bin_emb_txt(out_path,emb_file):
        txt_name = basename(emb_file).split(".")[0] +".txt"
        emb_txt_file = os.path.join(out_path,txt_name)
        emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
        emb_model.save_word2vec_format(emb_txt_file,binary=False)
        return emb_txt_file
    
    emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
    custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                      cache='custom_embeddings',
                                      unk_init=torch.Tensor.normal_)
    
    TEXT.build_vocab(train_data,
                     max_size=MAX_VOCAB_SIZE,
                     vectors=custom_embeddings,
                     unk_init=torch.Tensor.normal_)
    

    tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.

    I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

提交回复
热议问题