Load Pretrained glove vectors in python

前端 未结 10 670
眼角桃花
眼角桃花 2021-01-29 22:12

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file u

10条回答
  •  误落风尘
    2021-01-29 22:38

    Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).

    What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).

    Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.

    With this code you convert your embedding text file to the two new files:

    def convert_to_binary(embedding_path):
        f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
        wv = []
    
        with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
            count = 0
            for line in f:
                splitlines = line.split()
                vocab_write.write(splitlines[0].strip())
                vocab_write.write("\n")
                wv.append([float(val) for val in splitlines[1:]])
            count += 1
    
        np.save(embedding_path + ".npy", np.array(wv))
    

    And with this method you load it efficiently into your memory:

    def load_word_emb_binary(embedding_file_name_w_o_suffix):
        print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))
    
        with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
            index2word = [line.strip() for line in f_in]
    
        wv = np.load(embedding_file_name_w_o_suffix + '.npy')
        word_embedding_map = {}
        for i, w in enumerate(index2word):
            word_embedding_map[w] = wv[i]
    
        return word_embedding_map
    

    Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.

提交回复
热议问题