I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file u
Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).
What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).
Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.
With this code you convert your embedding text file to the two new files:
def convert_to_binary(embedding_path):
f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
wv = []
with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
count = 0
for line in f:
splitlines = line.split()
vocab_write.write(splitlines[0].strip())
vocab_write.write("\n")
wv.append([float(val) for val in splitlines[1:]])
count += 1
np.save(embedding_path + ".npy", np.array(wv))
And with this method you load it efficiently into your memory:
def load_word_emb_binary(embedding_file_name_w_o_suffix):
print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))
with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
index2word = [line.strip() for line in f_in]
wv = np.load(embedding_file_name_w_o_suffix + '.npy')
word_embedding_map = {}
for i, w in enumerate(index2word):
word_embedding_map[w] = wv[i]
return word_embedding_map
Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.