Using keras tokenizer for new words not in training set

后端 未结 3 1425
没有蜡笔的小新
没有蜡笔的小新 2020-12-28 08:45

I\'m currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. Howeve

3条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-28 09:13

    I would try a different approach. The main problem is that your word_index is based on your training data. Try this:

    #load glove embedding into a dict
    embeddings_index = {}
    dims = 100
    glove_data = 'glove.6B.'+str(dims)+'d.txt'
    f = open(glove_data)
    for line in f:
        values = line.split()
        word = values[0]
        value = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = value
    f.close()
    
    word_index = {w: i for i, w in enumerate(embeddings_index.keys(), 1)}
    
    #create embedding matrix
    embedding_matrix = np.zeros((len(word_index) + 1, dims))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector[:dims]
    

    Now your embedding_matrix contains all the GloVe works.

    To tokenize your texts you can use something like this:

    from keras.preprocessing.text import text_to_word_sequence
    
    def texts_to_sequences(texts, word_index):
        for text in texts:
            tokens = text_to_word_sequence(text)
            yield [word_index.get(w) for w in tokens if w in word_index]
    
    sequence = texts_to_sequences(['Test sentence'], word_index)
    

提交回复
热议问题