Load Pretrained glove vectors in python

前端未结

关注

 10  670

眼角桃花 2021-01-29 22:12

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file u

10条回答

误落风尘 (楼主)

2021-01-29 22:38
Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).

What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).

Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.

With this code you convert your embedding text file to the two new files:
```
def convert_to_binary(embedding_path):
    f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
    wv = []

    with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
        count = 0
        for line in f:
            splitlines = line.split()
            vocab_write.write(splitlines[0].strip())
            vocab_write.write("\n")
            wv.append([float(val) for val in splitlines[1:]])
        count += 1

    np.save(embedding_path + ".npy", np.array(wv))
```
And with this method you load it efficiently into your memory:
```
def load_word_emb_binary(embedding_file_name_w_o_suffix):
    print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))

    with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
        index2word = [line.strip() for line in f_in]

    wv = np.load(embedding_file_name_w_o_suffix + '.npy')
    word_embedding_map = {}
    for i, w in enumerate(index2word):
        word_embedding_map[w] = wv[i]

    return word_embedding_map
```
Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...