image caption解读系列（二）：《Show, Attend and Tell_Neural Image Caption》

https://blog.csdn.net/shenxiaolu1984/article/details/51493673

三、模型结构

对LSTM部分做出的改动，其余与NIC相同。

四、代码分析

 reshaped_conv5_3_feats = tf.reshape(conv5_3_feats,[config.batch_size, 196, 512])

             context_mean = tf.reduce_mean(self.conv_feats, axis = 1)  #图像特征作为最初的context (batch_size,512)             initial_memory, initial_output = self.initialize(context_mean)#使用两个全连接层得到最初的memory（c）和out（o）             initial_state = initial_memory, initial_output     #最初的输入state

αt维度为L=196L=196，记录释义aa每个像素位置获得的关注。

权重αt可以由前一步系统隐变量htht经过若干全连接层获得。编码et用于存储前一步的信息。灰色表示模块中有需要优化的参数。

“看哪儿”不单和实际图像有关，还受之前看到东西的影响。

第一步权重完全由图像特征aa决定：

  alpha = self.attend(contexts, last_output)  #引入注意力机制，加入权重  (batch_size,196)对196个区域的权重                 context = tf.reduce_sum(contexts*tf.expand_dims(alpha, 2),                                         axis = 1)  #加权之后的context (batch_size,512)                 if self.is_train:                     tiled_masks = tf.tile(tf.expand_dims(masks[:, idx], 1),                                          [1, self.num_ctx])  #(batch_size,196)  masks[:, idx] 全部批次某个时刻的mask                     masked_alpha = alpha * tiled_masks   #得到加权后的结果  如果maskd对应的是0 权重也就变成了0                     alphas.append(tf.reshape(masked_alpha, [-1]))  #masked_alpha： （batch_size,196）

把当前时刻的权重存入列表。

 alphas.append(tf.reshape(masked_alpha, [-1]))  #masked_alpha： （batch_size,196）

 current_input = tf.concat([context, word_embed], 1)  #当前时刻的输入是 加权后context 和word_embeeding的结合  （bacth_size,1024）                 output, state = lstm(current_input, last_state)  #(batch_size,512)                 memory, _ = state   #其他show and tell一样  (bacth_size,512)  (batch_size,512)

利用得到的输出和加权的context计算下一个单词的概率。做出预测

                 logits = self.decode(expanded_output)  #(bacth_size,5000)                 probs = tf.nn.softmax(logits)                 prediction = tf.argmax(logits, 1)

 last_output = output                 last_memory = memory                 last_state = state                 last_word = sentences[:, idx]  #开始下一个单词

文章来源: image caption解读系列（二）：《Show, Attend and Tell_Neural Image Caption》