I\'m currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. Howeve
I had the same problem. In fact, Gloved covered about 90 percent of my data before it was tokenized.
what I did was that I created a list of the words from my text column in pandas dataframe and then created a dictionary of them with enumerate
.
(just like what tokenizer in Keras does but without changing the words and listing them by their frequency).
Then I checked for words in Glove and added the vector in Glove to my initial weights matrix, whenever my word was in the Glove dictionary.
I hope the explanation was clear. This is the code for further explanation:
# creating a vocab of my data
vocab_of_text = set(" ".join(df_concat.text).lower().split())
# creating a dictionary of vocab with index
vocab_of_text = list(enumerate(vocab_of_text, 1))
# putting the index first
indexed_vocab = {k:v for v,k in dict(vocab_of_text).items()}
Then we use Glove for our weights matrix:
# creating a matrix for initial weights
vocab_matrix = np.zeros((len(indexed_vocab)+1,100))
# searching for vactors in Glove
for i, word in indexed_vocab.items():
vector = embedding_index.get(word)
# embedding index is a dictionary of Glove
# with the shape of 'word': vecor
if vector is not None:
vocab_matrix[i] = vector
and then for making it ready for embedding:
def text_to_sequence(text, word_index):
tokens = text.lower().split()
return [word_index.get(token) for token in tokens if word_index.get(token) is not None]
# giving ids
df_concat['sequences'] = df_concat.text.apply(lambda x : text_to_sequence(x, indexed_vocab))
max_len_seq = 34
# padding
padded = pad_sequences(df_concat['sequences'] ,
maxlen = max_len_seq, padding = 'post',
truncating = 'post')
also thanks to @spadarian for his answer. I could come up with this after reading and implementing his idea.part.