Tensorflow: Word2vec CBOW model

后端 未结 3 652
余生分开走
余生分开走 2020-12-30 07:25

I am new to tensorflow and to word2vec. I just studied the word2vec_basic.py which trains the model using Skip-Gram algorithm. Now I want to train using C

3条回答
  •  猫巷女王i
    2020-12-30 07:54

    For CBOW, You need to change only few parts of the code word2vec_basic.py. Overall the training structure and method are the same.

    Which parts should I change in word2vec_basic.py?

    1) The way it generates training data pairs. Because in CBOW, you are predicting the center word, not the context words.

    The new version for generate_batch will be

    def generate_batch(batch_size, bag_window):
      global data_index
      span = 2 * bag_window + 1 # [ bag_window target bag_window ]
      batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
      labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  
      buffer = collections.deque(maxlen=span)
      for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
      for i in range(batch_size):
        # just for testing
        buffer_list = list(buffer)
        labels[i, 0] = buffer_list.pop(bag_window)
        batch[i] = buffer_list
        # iterate to the next buffer
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
      return batch, labels
    

    Then new training data for CBOW would be

    data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the']
    
    #with bag_window = 1:
        batch: [['anarchism', 'as'], ['originated', 'a'], ['as', 'term'], ['a', 'of']]
        labels: ['originated', 'as', 'a', 'term']
    

    compared to Skip-gram's data

    #with num_skips = 2 and skip_window = 1:
        batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term', 'of', 'of', 'abuse', 'abuse', 'first', 'first', 'used', 'used']
        labels: ['as', 'anarchism', 'originated', 'a', 'term', 'as', 'a', 'of', 'term', 'abuse', 'of', 'first', 'used', 'abuse', 'against', 'first']
    

    2) Therefore you also need to change the variable shape

    train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    

    to

    train_dataset = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])
    

    3) loss function

     loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(
      weights = softmax_weights, biases = softmax_biases, inputs = tf.reduce_sum(embed, 1), labels = train_labels, num_sampled= num_sampled, num_classes= vocabulary_size))
    

    Notice inputs = tf.reduce_sum(embed, 1) as Zichen Wang mentioned it.

    This is it!

提交回复
热议问题