Tensorflow: Word2vec CBOW model

后端未结
关注
 3  652
余生分开走 2020-12-30 07:25
I am new to tensorflow and to word2vec. I just studied the word2vec_basic.py which trains the model using Skip-Gram algorithm. Now I want to train using C

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   猫巷女王i
                                             
                
                
                (楼主)
            
              
              
                2020-12-30 07:54
              

            
            
                        
For CBOW, You need to change only few parts of the code word2vec_basic.py. Overall the training structure and method are the same.

Which parts should I change in word2vec_basic.py?

1) The way it generates training data pairs. Because in CBOW, you are predicting the center word, not the context words. 

The new version for generate_batch will be

def generate_batch(batch_size, bag_window):
  global data_index
  span = 2 * bag_window + 1 # [ bag_window target bag_window ]
  batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size):
    # just for testing
    buffer_list = list(buffer)
    labels[i, 0] = buffer_list.pop(bag_window)
    batch[i] = buffer_list
    # iterate to the next buffer
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels


Then new training data for CBOW would be 

data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the']

#with bag_window = 1:
    batch: [['anarchism', 'as'], ['originated', 'a'], ['as', 'term'], ['a', 'of']]
    labels: ['originated', 'as', 'a', 'term']


compared to Skip-gram's data

#with num_skips = 2 and skip_window = 1:
    batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term', 'of', 'of', 'abuse', 'abuse', 'first', 'first', 'used', 'used']
    labels: ['as', 'anarchism', 'originated', 'a', 'term', 'as', 'a', 'of', 'term', 'abuse', 'of', 'first', 'used', 'abuse', 'against', 'first']


2) Therefore you also need to change the variable shape

train_dataset = tf.placeholder(tf.int32, shape=[batch_size])


to 

train_dataset = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])


3) loss function

 loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(
  weights = softmax_weights, biases = softmax_biases, inputs = tf.reduce_sum(embed, 1), labels = train_labels, num_sampled= num_sampled, num_classes= vocabulary_size))


Notice inputs = tf.reduce_sum(embed, 1) as Zichen Wang mentioned it. 

This is it!
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复