I\'m currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. Howeve
I would try a different approach. The main problem is that your word_index is based on your training data. Try this:
#load glove embedding into a dict
embeddings_index = {}
dims = 100
glove_data = 'glove.6B.'+str(dims)+'d.txt'
f = open(glove_data)
for line in f:
values = line.split()
word = values[0]
value = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = value
f.close()
word_index = {w: i for i, w in enumerate(embeddings_index.keys(), 1)}
#create embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, dims))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector[:dims]
Now your embedding_matrix contains all the GloVe works.
To tokenize your texts you can use something like this:
from keras.preprocessing.text import text_to_word_sequence
def texts_to_sequences(texts, word_index):
for text in texts:
tokens = text_to_word_sequence(text)
yield [word_index.get(w) for w in tokens if w in word_index]
sequence = texts_to_sequences(['Test sentence'], word_index)