Using keras tokenizer for new words not in training set

后端 未结 3 1415
没有蜡笔的小新
没有蜡笔的小新 2020-12-28 08:45

I\'m currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. Howeve

相关标签:
3条回答
  • 2020-12-28 09:07

    I had the same problem. In fact, Gloved covered about 90 percent of my data before it was tokenized.

    what I did was that I created a list of the words from my text column in pandas dataframe and then created a dictionary of them with enumerate.

    (just like what tokenizer in Keras does but without changing the words and listing them by their frequency).

    Then I checked for words in Glove and added the vector in Glove to my initial weights matrix, whenever my word was in the Glove dictionary.

    I hope the explanation was clear. This is the code for further explanation:

    # creating a vocab of my data
    vocab_of_text = set(" ".join(df_concat.text).lower().split())
    
    # creating a dictionary of vocab with index
    vocab_of_text = list(enumerate(vocab_of_text, 1))
    
    # putting the index first
    indexed_vocab = {k:v for v,k in dict(vocab_of_text).items()}
    

    Then we use Glove for our weights matrix:

    # creating a matrix for initial weights
    vocab_matrix = np.zeros((len(indexed_vocab)+1,100))
    
    
    
    # searching for vactors in Glove
    for i, word in indexed_vocab.items():
        vector = embedding_index.get(word)
        # embedding index is a dictionary of Glove
        # with the shape of 'word': vecor
    
        if vector is not None:
            vocab_matrix[i] = vector
    

    and then for making it ready for embedding:

    def text_to_sequence(text, word_index):
        tokens = text.lower().split()
        return [word_index.get(token) for token in tokens if word_index.get(token) is not None]
    
    # giving ids
    df_concat['sequences'] = df_concat.text.apply(lambda x : text_to_sequence(x, indexed_vocab))
    
    max_len_seq = 34
    
    # padding
    padded = pad_sequences(df_concat['sequences'] ,
                  maxlen = max_len_seq, padding = 'post', 
                  truncating = 'post')
    

    also thanks to @spadarian for his answer. I could come up with this after reading and implementing his idea.part.

    0 讨论(0)
  • 2020-12-28 09:13

    I would try a different approach. The main problem is that your word_index is based on your training data. Try this:

    #load glove embedding into a dict
    embeddings_index = {}
    dims = 100
    glove_data = 'glove.6B.'+str(dims)+'d.txt'
    f = open(glove_data)
    for line in f:
        values = line.split()
        word = values[0]
        value = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = value
    f.close()
    
    word_index = {w: i for i, w in enumerate(embeddings_index.keys(), 1)}
    
    #create embedding matrix
    embedding_matrix = np.zeros((len(word_index) + 1, dims))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector[:dims]
    

    Now your embedding_matrix contains all the GloVe works.

    To tokenize your texts you can use something like this:

    from keras.preprocessing.text import text_to_word_sequence
    
    def texts_to_sequences(texts, word_index):
        for text in texts:
            tokens = text_to_word_sequence(text)
            yield [word_index.get(w) for w in tokens if w in word_index]
    
    sequence = texts_to_sequences(['Test sentence'], word_index)
    
    0 讨论(0)
  • 2020-12-28 09:30

    In Keras Tokenizer you have the oov_token parameter. Just select your token and unknown words will have that one.

    tokenizer_a = Tokenizer(oov_token=1)
    tokenizer_b = Tokenizer()
    tokenizer_a.fit_on_texts(["Hello world"])
    tokenizer_b.fit_on_texts(["Hello world"])
    

    Outputs

    In [26]: tokenizer_a.texts_to_sequences(["Hello cruel world"])
    Out[26]: [[2, 1, 3]]
    
    In [27]: tokenizer_b.texts_to_sequences(["Hello cruel world"])
    Out[27]: [[1, 2]]
    
    0 讨论(0)
提交回复
热议问题