TensorFlow 从入门到精通：tensorflow.nn 详解

看过前面的例子，会发现实现深度神经网络需要使用 tensorflow.nn 这个核心模块。我们通过源码来一探究竟。

   1 # Copyright 2015 Google Inc. All Rights Reserved.
   2 #
   3 # Licensed under the Apache License, Version 2.0 (the "License");
   4 # you may not use this file except in compliance with the License.
   5 # You may obtain a copy of the License at
   6 #
   7 #     http://www.apache.org/licenses/LICENSE-2.0
   8 #
   9 # Unless required by applicable law or agreed to in writing, software
  10 # distributed under the License is distributed on an "AS IS" BASIS,
  11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12 # See the License for the specific language governing permissions and
  13 # limitations under the License.
  14 # ==============================================================================
  15 
  16 # pylint: disable=unused-import,g-bad-import-order
  17 """## Activation Functions
  18 
  19 The activation ops provide different types of nonlinearities for use in neural
  20 networks.  These include smooth nonlinearities (`sigmoid`, `tanh`, `elu`,
  21 `softplus`, and `softsign`), continuous but not everywhere differentiable
  22 functions (`relu`, `relu6`, and `relu_x`), and random regularization
  23 (`dropout`).
  24 
  25 All activation ops apply componentwise, and produce a tensor of the same
  26 shape as the input tensor.
  27 
  28 @@relu
  29 @@relu6
  30 @@elu
  31 @@softplus
  32 @@softsign
  33 @@dropout
  34 @@bias_add
  35 @@sigmoid
  36 @@tanh
  37 
  38 ## Convolution
  39 
  40 The convolution ops sweep a 2-D filter over a batch of images, applying the
  41 filter to each window of each image of the appropriate size.  The different
  42 ops trade off between generic vs. specific filters:
  43 
  44 * `conv2d`: Arbitrary filters that can mix channels together.
  45 * `depthwise_conv2d`: Filters that operate on each channel independently.
  46 * `separable_conv2d`: A depthwise spatial filter followed by a pointwise filter.
  47 
  48 Note that although these ops are called "convolution", they are strictly
  49 speaking "cross-correlation" since the filter is combined with an input window
  50 without reversing the filter.  For details, see [the properties of
  51 cross-correlation](https://en.wikipedia.org/wiki/Cross-correlation#Properties).
  52 
  53 The filter is applied to image patches of the same size as the filter and
  54 strided according to the `strides` argument.  `strides = [1, 1, 1, 1]` applies
  55 the filter to a patch at every offset, `strides = [1, 2, 2, 1]` applies the
  56 filter to every other image patch in each dimension, etc.
  57 
  58 Ignoring channels for the moment, and assume that the 4-D `input` has shape
  59 `[batch, in_height, in_width, ...]` and the 4-D `filter` has shape
  60 `[filter_height, filter_width, ...]`, then the spatial semantics of the
  61 convolution ops are as follows: first, according to the padding scheme chosen
  62 as `'SAME'` or `'VALID'`, the output size and the padding pixels are computed.
  63 For the `'SAME'` padding, the output height and width are computed as:
  64 
  65     out_height = ceil(float(in_height) / float(strides[1]))
  66     out_width  = ceil(float(in_width) / float(strides[2]))
  67 
  68 and the padding on the top and left are computed as:
  69 
  70     pad_along_height = ((out_height - 1) * strides[1] +
  71                         filter_height - in_height)
  72     pad_along_width = ((out_width - 1) * strides[2] +
  73                        filter_width - in_width)
  74     pad_top = pad_along_height / 2
  75     pad_left = pad_along_width / 2
  76 
  77 Note that the division by 2 means that there might be cases when the padding on
  78 both sides (top vs bottom, right vs left) are off by one. In this case, the
  79 bottom and right sides always get the one additional padded pixel. For example,
  80 when `pad_along_height` is 5, we pad 2 pixels at the top and 3 pixels at the
  81 bottom. Note that this is different from existing libraries such as cuDNN and
  82 Caffe, which explicitly specify the number of padded pixels and always pad the
  83 same number of pixels on both sides.
  84 
  85 For the `'VALID`' padding, the output height and width are computed as:
  86 
  87     out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
  88     out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
  89 
  90 and the padding values are always zero. The output is then computed as
  91 
  92     output[b, i, j, :] =
  93         sum_{di, dj} input[b, strides[1] * i + di - pad_top,
  94                            strides[2] * j + dj - pad_left, ...] *
  95                      filter[di, dj, ...]
  96 
  97 where any value outside the original input image region are considered zero (
  98 i.e. we pad zero values around the border of the image).
  99 
 100 Since `input` is 4-D, each `input[b, i, j, :]` is a vector.  For `conv2d`, these
 101 vectors are multiplied by the `filter[di, dj, :, :]` matrices to produce new
 102 vectors.  For `depthwise_conv_2d`, each scalar component `input[b, i, j, k]`
 103 is multiplied by a vector `filter[di, dj, k]`, and all the vectors are
 104 concatenated.
 105 
 106 @@conv2d
 107 @@depthwise_conv2d
 108 @@separable_conv2d
 109 @@conv2d_transpose
 110 
 111 ## Pooling
 112 
 113 The pooling ops sweep a rectangular window over the input tensor, computing a
 114 reduction operation for each window (average, max, or max with argmax).  Each
 115 pooling op uses rectangular windows of size `ksize` separated by offset
 116 `strides`.  For example, if `strides` is all ones every window is used, if
 117 `strides` is all twos every other window is used in each dimension, etc.
 118 
 119 In detail, the output is
 120 
 121     output[i] = reduce(value[strides * i:strides * i + ksize])
 122 
 123 where the indices also take into consideration the padding values. Please refer
 124 to the `Convolution` section for details about the padding calculation.
 125 
 126 @@avg_pool
 127 @@max_pool
 128 @@max_pool_with_argmax
 129 
 130 ## Normalization
 131 
 132 Normalization is useful to prevent neurons from saturating when inputs may
 133 have varying scale, and to aid generalization.
 134 
 135 @@l2_normalize
 136 @@local_response_normalization
 137 @@sufficient_statistics
 138 @@normalize_moments
 139 @@moments
 140 
 141 ## Losses
 142 
 143 The loss ops measure error between two tensors, or between a tensor and zero.
 144 These can be used for measuring accuracy of a network in a regression task
 145 or for regularization purposes (weight decay).
 146 
 147 @@l2_loss
 148 
 149 ## Classification
 150 
 151 TensorFlow provides several operations that help you perform classification.
 152 
 153 @@sigmoid_cross_entropy_with_logits
 154 @@softmax
 155 @@log_softmax
 156 @@softmax_cross_entropy_with_logits
 157 @@sparse_softmax_cross_entropy_with_logits
 158 @@weighted_cross_entropy_with_logits
 159 
 160 ## Embeddings
 161 
 162 TensorFlow provides library support for looking up values in embedding
 163 tensors.
 164 
 165 @@embedding_lookup
 166 @@embedding_lookup_sparse
 167 
 168 ## Evaluation
 169 
 170 The evaluation ops are useful for measuring the performance of a network.
 171 Since they are nondifferentiable, they are typically used at evaluation time.
 172 
 173 @@top_k
 174 @@in_top_k
 175 
 176 ## Candidate Sampling
 177 
 178 Do you want to train a multiclass or multilabel model with thousands
 179 or millions of output classes (for example, a language model with a
 180 large vocabulary)?  Training with a full Softmax is slow in this case,
 181 since all of the classes are evaluated for every training example.
 182 Candidate Sampling training algorithms can speed up your step times by
 183 only considering a small randomly-chosen subset of contrastive classes
 184 (called candidates) for each batch of training examples.
 185 
 186 See our [Candidate Sampling Algorithms Reference]
 187 (../../extras/candidate_sampling.pdf)
 188 
 189 ### Sampled Loss Functions
 190 
 191 TensorFlow provides the following sampled loss functions for faster training.
 192 
 193 @@nce_loss
 194 @@sampled_softmax_loss
 195 
 196 ### Candidate Samplers
 197 
 198 TensorFlow provides the following samplers for randomly sampling candidate
 199 classes when using one of the sampled loss functions above.
 200 
 201 @@uniform_candidate_sampler
 202 @@log_uniform_candidate_sampler
 203 @@learned_unigram_candidate_sampler
 204 @@fixed_unigram_candidate_sampler
 205 
 206 ### Miscellaneous candidate sampling utilities
 207 
 208 @@compute_accidental_hits
 209 
 210 """
 211 from __future__ import absolute_import
 212 from __future__ import division
 213 from __future__ import print_function
 214 
 215 from six.moves import xrange  # pylint: disable=redefined-builtin
 216 
 217 from tensorflow.python.framework import dtypes
 218 from tensorflow.python.framework import ops
 219 from tensorflow.python.framework import tensor_shape
 220 from tensorflow.python.ops import array_ops
 221 from tensorflow.python.ops import candidate_sampling_ops
 222 from tensorflow.python.ops import constant_op
 223 from tensorflow.python.ops import control_flow_ops
 224 from tensorflow.python.ops import embedding_ops
 225 from tensorflow.python.ops import init_ops
 226 from tensorflow.python.ops import math_ops
 227 from tensorflow.python.ops import nn_grad
 228 from tensorflow.python.ops import nn_ops
 229 from tensorflow.python.ops import numerics
 230 from tensorflow.python.ops import random_ops
 231 from tensorflow.python.ops import rnn_cell
 232 from tensorflow.python.ops import seq2seq
 233 from tensorflow.python.ops import sparse_ops
 234 from tensorflow.python.ops import variable_scope as vs
 235 from tensorflow.python.ops.math_ops import sigmoid
 236 from tensorflow.python.ops.math_ops import tanh
 237 from tensorflow.python.util.all_util import make_all
 238 
 239 # Bring more nn-associated functionality into this package.
 240 # go/tf-wildcard-import
 241 # pylint: disable=wildcard-import
 242 from tensorflow.python.ops.nn_ops import *
 243 from tensorflow.python.ops.candidate_sampling_ops import *
 244 from tensorflow.python.ops.embedding_ops import *
 245 from tensorflow.python.ops.rnn import *
 246 # pylint: enable=wildcard-import
 247 
 248 
 249 def sigmoid_cross_entropy_with_logits(logits, targets, name=None):
 250   """Computes sigmoid cross entropy given `logits`.
 251 
 252   Measures the probability error in discrete classification tasks in which each
 253   class is independent and not mutually exclusive.  For instance, one could
 254   perform multilabel classification where a picture can contain both an elephant
 255   and a dog at the same time.
 256 
 257   For brevity, let `x = logits`, `z = targets`.  The logistic loss is
 258 
 259         z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
 260       = z * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
 261       = z * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
 262       = z * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
 263       = (1 - z) * x + log(1 + exp(-x))
 264       = x - x * z + log(1 + exp(-x))
 265 
 266   To ensure stability and avoid overflow, the implementation uses
 267 
 268       max(x, 0) - x * z + log(1 + exp(-abs(x)))
 269 
 270   `logits` and `targets` must have the same type and shape.
 271 
 272   Args:
 273     logits: A `Tensor` of type `float32` or `float64`.
 274     targets: A `Tensor` of the same type and shape as `logits`.
 275     name: A name for the operation (optional).
 276 
 277   Returns:
 278     A `Tensor` of the same shape as `logits` with the componentwise
 279     logistic losses.
 280 
 281   Raises:
 282     ValueError: If `logits` and `targets` do not have the same shape.
 283   """
 284   with ops.op_scope([logits, targets], name, "logistic_loss") as name:
 285     logits = ops.convert_to_tensor(logits, name="logits")
 286     targets = ops.convert_to_tensor(targets, name="targets")
 287     try:
 288       targets.get_shape().merge_with(logits.get_shape())
 289     except ValueError:
 290       raise ValueError(
 291           "logits and targets must have the same shape (%s vs %s)"
 292           % (logits.get_shape(), targets.get_shape()))
 293 
 294     # The logistic loss formula from above is
 295     #   x - x * z + log(1 + exp(-x))
 296     # For x < 0, a more numerically stable formula is
 297     #   -x * z + log(1 + exp(x))
 298     # To avoid branching, we use the combined version
 299     #   max(x, 0) - x * z + log(1 + exp(-abs(x)))
 300     return math_ops.add(nn_ops.relu(logits) - logits * targets,
 301                         math_ops.log(1 + math_ops.exp(-math_ops.abs(logits))),
 302                         name=name)
 303 
 304 
 305 def weighted_cross_entropy_with_logits(logits, targets, pos_weight,
 306                                        name=None):
 307   """Computes a weighted cross entropy.
 308 
 309   This is like `sigmoid_cross_entropy_with_logits()` except that `pos_weight`,
 310   allows one to trade off recall and precision by up- or down-weighting the
 311   cost of a positive error relative to a negative error.
 312 
 313   The usual cross-entropy cost is defined as:
 314 
 315     targets * -log(sigmoid(logits)) + (1 - targets) * -log(1 - sigmoid(logits))
 316 
 317   The argument `pos_weight` is used as a multiplier for the positive targets:
 318 
 319     targets * -log(sigmoid(logits)) * pos_weight +
 320         (1 - targets) * -log(1 - sigmoid(logits))
 321 
 322   For brevity, let `x = logits`, `z = targets`, `q = pos_weight`.
 323   The loss is:
 324 
 325         qz * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
 326       = qz * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
 327       = qz * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
 328       = qz * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
 329       = (1 - z) * x + (qz +  1 - z) * log(1 + exp(-x))
 330       = (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(-x))
 331 
 332   Setting `l = (1 + (q - 1) * z)`, to ensure stability and avoid overflow,
 333   the implementation uses
 334 
 335       (1 - z) * x + l * (log(1 + exp(-abs(x))) + max(-x, 0))
 336 
 337   `logits` and `targets` must have the same type and shape.
 338 
 339   Args:
 340     logits: A `Tensor` of type `float32` or `float64`.
 341     targets: A `Tensor` of the same type and shape as `logits`.
 342     pos_weight: A coefficient to use on the positive examples.
 343     name: A name for the operation (optional).
 344 
 345   Returns:
 346     A `Tensor` of the same shape as `logits` with the componentwise
 347     weightedlogistic losses.
 348 
 349   Raises:
 350     ValueError: If `logits` and `targets` do not have the same shape.
 351   """
 352   with ops.op_scope([logits, targets], name, "logistic_loss") as name:
 353     logits = ops.convert_to_tensor(logits, name="logits")
 354     targets = ops.convert_to_tensor(targets, name="targets")
 355     try:
 356       targets.get_shape().merge_with(logits.get_shape())
 357     except ValueError:
 358       raise ValueError(
 359           "logits and targets must have the same shape (%s vs %s)"
 360           % (logits.get_shape(), targets.get_shape()))
 361 
 362     # The logistic loss formula from above is
 363     #   (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(-x))
 364     # For x < 0, a more numerically stable formula is
 365     #   (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(x)) - l * x
 366     # To avoid branching, we use the combined version
 367     #   (1 - z) * x + l * (log(1 + exp(-abs(x))) + max(-x, 0))
 368     log_weight = 1 + (pos_weight - 1) * targets
 369     return math_ops.add(
 370         (1 - targets) * logits,
 371         log_weight * (math_ops.log(1 + math_ops.exp(-math_ops.abs(logits))) +
 372                       nn_ops.relu(-logits)),
 373         name=name)
 374 
 375 
 376 def relu_layer(x, weights, biases, name=None):
 377   """Computes Relu(x * weight + biases).
 378 
 379   Args:
 380     x: a 2D tensor.  Dimensions typically: batch, in_units
 381     weights: a 2D tensor.  Dimensions typically: in_units, out_units
 382     biases: a 1D tensor.  Dimensions: out_units
 383     name: A name for the operation (optional).  If not specified
 384       "nn_relu_layer" is used.
 385 
 386   Returns:
 387     A 2-D Tensor computing relu(matmul(x, weights) + biases).
 388     Dimensions typically: batch, out_units.
 389   """
 390   with ops.op_scope([x, weights, biases], name, "relu_layer") as name:
 391     x = ops.convert_to_tensor(x, name="x")
 392     weights = ops.convert_to_tensor(weights, name="weights")
 393     biases = ops.convert_to_tensor(biases, name="biases")
 394     xw_plus_b = nn_ops.bias_add(math_ops.matmul(x, weights), biases)
 395     return nn_ops.relu(xw_plus_b, name=name)
 396 
 397 
 398 def l2_normalize(x, dim, epsilon=1e-12, name=None):
 399   """Normalizes along dimension `dim` using an L2 norm.
 400 
 401   For a 1-D tensor with `dim = 0`, computes
 402 
 403       output = x / sqrt(max(sum(x**2), epsilon))
 404 
 405   For `x` with more dimensions, independently normalizes each 1-D slice along
 406   dimension `dim`.
 407 
 408   Args:
 409     x: A `Tensor`.
 410     dim: Dimension along which to normalize.
 411     epsilon: A lower bound value for the norm. Will use `sqrt(epsilon)` as the
 412       divisor if `norm < sqrt(epsilon)`.
 413     name: A name for this operation (optional).
 414 
 415   Returns:
 416     A `Tensor` with the same shape as `x`.
 417   """
 418   with ops.op_scope([x], name, "l2_normalize") as name:
 419     x = ops.convert_to_tensor(x, name="x")
 420     square_sum = math_ops.reduce_sum(math_ops.square(x), [dim], keep_dims=True)
 421     x_inv_norm = math_ops.rsqrt(math_ops.maximum(square_sum, epsilon))
 422     return math_ops.mul(x, x_inv_norm, name=name)
 423 
 424 
 425 def zero_fraction(value, name=None):
 426   """Returns the fraction of zeros in `value`.
 427 
 428   If `value` is empty, the result is `nan`.
 429 
 430   This is useful in summaries to measure and report sparsity.  For example,
 431 
 432       z = tf.Relu(...)
 433       summ = tf.scalar_summary('sparsity', tf.nn.zero_fraction(z))
 434 
 435   Args:
 436     value: A tensor of numeric type.
 437     name: A name for the operation (optional).
 438 
 439   Returns:
 440     The fraction of zeros in `value`, with type `float32`.
 441   """
 442   with ops.op_scope([value], name, "zero_fraction"):
 443     value = ops.convert_to_tensor(value, name="value")
 444     zero = constant_op.constant(0, dtype=value.dtype, name="zero")
 445     return math_ops.reduce_mean(math_ops.cast(math_ops.equal(value, zero),
 446                                               dtypes.float32))
 447 
 448 
 449 def depthwise_conv2d(input, filter, strides, padding, name=None):
 450   """Depthwise 2-D convolution.
 451 
 452   Given an input tensor of shape `[batch, in_height, in_width, in_channels]`
 453   and a filter tensor of shape
 454   `[filter_height, filter_width, in_channels, channel_multiplier]`
 455   containing `in_channels` convolutional filters of depth 1, `depthwise_conv2d`
 456   applies a different filter to each input channel (expanding from 1 channel
 457   to `channel_multiplier` channels for each), then concatenates the results
 458   together.  The output has `in_channels * channel_multiplier` channels.
 459 
 460   In detail,
 461 
 462       output[b, i, j, k * channel_multiplier + q] =
 463           sum_{di, dj} input[b, strides[1] * i + di, strides[2] * j + dj, k] *
 464                        filter[di, dj, k, q]
 465 
 466   Must have `strides[0] = strides[3] = 1`.  For the most common case of the
 467   same horizontal and vertical strides, `strides = [1, stride, stride, 1]`.
 468 
 469   Args:
 470     input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
 471     filter: 4-D with shape
 472       `[filter_height, filter_width, in_channels, channel_multiplier]`.
 473     strides: 1-D of size 4.  The stride of the sliding window for each
 474       dimension of `input`.
 475     padding: A string, either `'VALID'` or `'SAME'`.  The padding algorithm.
 476     name: A name for this operation (optional).
 477 
 478   Returns:
 479     A 4-D `Tensor` of shape
 480     `[batch, out_height, out_width, in_channels * channel_multiplier].`
 481   """
 482   with ops.op_scope([input, filter], name, "depthwise") as name:
 483     input = ops.convert_to_tensor(input, name="tensor_in")
 484     filter = ops.convert_to_tensor(filter, name="filter_in")
 485     # A shape is required to statically compute the number of separable filters.
 486     if filter.get_shape().ndims is not None:
 487       assert len(filter.get_shape()) == 4
 488       in_channels = filter.get_shape()[2]
 489       # Sanity checks, if shape information is available for the inputs.
 490       if input.get_shape().ndims is not None:
 491         assert len(input.get_shape()) == 4
 492         assert input.get_shape()[3] == in_channels, (
 493             "Mismatched input depth %d and number of depthwise filters %d." % (
 494                 input.get_shape()[3].value, in_channels))
 495     else:
 496       assert input.get_shape().ndims is not None, (
 497           "Either tensor must provide static shape information.")
 498       assert input.get_shape().ndims == 4
 499       in_channels = input.get_shape()[3]
 500 
 501     if in_channels == 1:
 502       return nn_ops.conv2d(input, filter, strides, padding, name=name)
 503     else:
 504       # Create one separate convolution per channel.
 505       convs = []
 506       for channel in xrange(in_channels):
 507         with ops.name_scope("depth%d" % channel) as channel_scope:
 508           t_in = array_ops.slice(input, [0, 0, 0, channel], [-1, -1, -1, 1],
 509                                  name="slice_inputs")
 510           f_in = array_ops.slice(filter, [0, 0, channel, 0], [-1, -1, 1, -1],
 511                                  name="slice_params")
 512           convs.append(nn_ops.conv2d(t_in, f_in,
 513                                      strides, padding, name=channel_scope))
 514       # Concatenate the per-channel convolutions along the channel dimension.
 515       return array_ops.concat(3, convs, name=name)
 516 
 517 
 518 def separable_conv2d(input, depthwise_filter, pointwise_filter, strides,
 519                      padding,
 520                      name=None):
 521   """2-D convolution with separable filters.
 522 
 523   Performs a depthwise convolution that acts separately on channels followed by
 524   a pointwise convolution that mixes channels.  Note that this is separability
 525   between dimensions `[1, 2]` and `3`, not spatial separability between
 526   dimensions `1` and `2`.
 527 
 528   In detail,
 529 
 530       output[b, i, j, k] = sum_{di, dj, q, r]
 531           input[b, strides[1] * i + di, strides[2] * j + dj, q] *
 532           depthwise_filter[di, dj, q, r] *
 533           pointwise_filter[0, 0, q * channel_multiplier + r, k]
 534 
 535   `strides` controls the strides for the depthwise convolution only, since
 536   the pointwise convolution has implicit strides of `[1, 1, 1, 1]`.  Must have
 537   `strides[0] = strides[3] = 1`.  For the most common case of the same
 538   horizontal and vertical strides, `strides = [1, stride, stride, 1]`.
 539 
 540   Args:
 541     input: 4-D `Tensor` with shape `[batch, in_height, in_width, in_channels]`.
 542     depthwise_filter: 4-D `Tensor` with shape
 543       `[filter_height, filter_width, in_channels, channel_multiplier]`.
 544       Contains `in_channels` convolutional filters of depth 1.
 545     pointwise_filter: 4-D `Tensor` with shape
 546       `[1, 1, channel_multiplier * in_channels, out_channels]`.  Pointwise
 547       filter to mix channels after `depthwise_filter` has convolved spatially.
 548     strides: 1-D of size 4.  The strides for the depthwise convolution for
 549       each dimension of `input`.
 550     padding: A string, either `'VALID'` or `'SAME'`.  The padding algorithm.
 551     name: A name for this operation (optional).
 552 
 553   Returns:
 554     A 4-D `Tensor` of shape `[batch, out_height, out_width, out_channels]`.
 555   """
 556   with ops.op_scope([input, depthwise_filter, pointwise_filter],
 557                    name, "separable_conv2d") as name:
 558     input = ops.convert_to_tensor(input, name="tensor_in")
 559     depthwise_filter = ops.convert_to_tensor(depthwise_filter,
 560                                              name="depthwise_filter")
 561     pointwise_filter = ops.convert_to_tensor(pointwise_filter,
 562                                              name="pointwise_filter")
 563 
 564     if pointwise_filter.get_shape().ndims is not None:
 565       assert len(pointwise_filter.get_shape()) == 4
 566       assert pointwise_filter.get_shape()[0] == 1
 567       assert pointwise_filter.get_shape()[1] == 1
 568       if depthwise_filter.get_shape().ndims and input.get_shape().ndims:
 569         channel_multiplier = depthwise_filter.get_shape()[3]
 570         in_channels = input.get_shape()[3]
 571         out_channels = pointwise_filter.get_shape()[3]
 572         # This would mean the separable convolutions is over-parametrized.
 573         assert channel_multiplier * in_channels < out_channels
 574     # The layout of the ops in the graph are expected to be as follows:
 575     # separable_conv2d  // Conv2D op corresponding to the pointwise conv.
 576     # separable_conv2d/depthwise  // Concat op for the deptwise outputs.
 577     # separable_conv2d/depthwise/depth0  // Conv2D op for depth 0
 578     # separable_conv2d/depthwise/depth1  // Conv2D op for depth 1
 579     # separable_conv2d/depthwise/depth2  // Conv2D op for depth 2
 580     depthwise = depthwise_conv2d(input, depthwise_filter, strides,
 581                                  padding, name="depthwise")
 582     return nn_ops.conv2d(depthwise, pointwise_filter, [1, 1, 1, 1],
 583                          padding="VALID", name=name)
 584 
 585 
 586 def sufficient_statistics(x, axes, shift=True, keep_dims=False, name=None):
 587   """Calculate the sufficient statistics for the mean and variance of `x`.
 588 
 589   These sufficient statistics are computed using the one pass algorithm on
 590   an input that's optionally shifted using the value of the 1st element in `x`.
 591   See:
 592   https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
 593 
 594   Args:
 595     x: A `Tensor`.
 596     axes: Array of ints. Axes along which to compute mean and variance.
 597     shift: If true, shift the data to provide more numerically stable results.
 598     keep_dims: produce statistics with the same dimensionality as the input.
 599     name: Name used to scope the operations that compute the sufficient stats.
 600 
 601   Returns:
 602     Four `Tensor` objects of the same type as `x`:
 603     * the count (number of elements to average over).
 604     * the (possibly shifted) sum of the elements in the array.
 605     * the (possibly shifted) sum of squares of the elements in the array.
 606     * the shift by which the mean must be corrected or None if `shift` is False.
 607   """
 608   with ops.op_scope([x, axes], name, "sufficient_statistics"):
 609     x = ops.convert_to_tensor(x, name="x")
 610     x_shape = x.get_shape()
 611     if x_shape.is_fully_defined():
 612       counts = 1
 613       m_shape = []
 614       for d in xrange(x_shape.ndims):
 615         dim = x_shape[d].value
 616         if d in set(axes):
 617           counts *= dim
 618           dim = 1
 619         m_shape.append(dim)
 620       counts = constant_op.constant(counts, dtype=x.dtype)
 621     else:  # shape needs to be inferred at runtime.
 622       x_shape = array_ops.shape(x)
 623       select_axes = sparse_ops.sparse_to_dense(axes, array_ops.shape(x_shape),
 624                                                True, False)
 625       m_shape = math_ops.select(select_axes, array_ops.ones_like(x_shape),
 626                                 x_shape)
 627       counts = math_ops.cast(
 628           math_ops.reduce_prod(x_shape / m_shape),
 629           x.dtype,
 630           name="count")
 631     if shift:
 632       shift_value = array_ops.slice(x, array_ops.zeros_like(m_shape), m_shape)
 633       m_ss = math_ops.sub(x, shift_value)
 634       v_ss = math_ops.squared_difference(x, shift_value)
 635       if keep_dims:
 636         shift_value = array_ops.identity(shift_value, name="shift")
 637       else:
 638         shift_value = array_ops.squeeze(shift_value,
 639                                         squeeze_dims=axes,
 640                                         name="shift")
 641     else:  # not shift.
 642       m_ss = x
 643       v_ss = math_ops.square(x)
 644       shift_value = None
 645     m_ss = math_ops.reduce_sum(m_ss, axes, keep_dims=keep_dims, name="mean_ss")
 646     v_ss = math_ops.reduce_sum(v_ss, axes, keep_dims=keep_dims, name="var_ss")
 647   return counts, m_ss, v_ss, shift_value
 648 
 649 
 650 def normalize_moments(counts, mean_ss, variance_ss, shift, name=None):
 651   """Calculate the mean and variance of based on the sufficient statistics.
 652 
 653   Args:
 654     counts: A `Tensor` containing a the total count of the data (one value).
 655     mean_ss: A `Tensor` containing the mean sufficient statistics: the (possibly
 656       shifted) sum of the elements to average over.
 657     variance_ss: A `Tensor` containing the variance sufficient statistics: the
 658       (possibly shifted) squared sum of the data to compute the variance over.
 659     shift: A `Tensor` containing the value by which the data is shifted for
 660       numerical stability, or `None` if no shift was performed.
 661     name: Name used to scope the operations that compute the moments.
 662 
 663   Returns:
 664     Two `Tensor` objects: `mean` and `variance`.
 665   """
 666   with ops.op_scope([counts, mean_ss, variance_ss, shift], name, "normalize"):
 667     divisor = math_ops.inv(counts, name="divisor")
 668     if shift is not None:
 669       shifted_mean = math_ops.mul(mean_ss, divisor, name="shifted_mean")
 670       mean = math_ops.add(shifted_mean, shift, name="mean")
 671     else:  # no shift.
 672       shifted_mean = math_ops.mul(mean_ss, divisor, name="mean")
 673       mean = shifted_mean
 674     variance = math_ops.sub(
 675         math_ops.mul(variance_ss, divisor),
 676         math_ops.square(shifted_mean),
 677         name="variance")
 678   return (mean, variance)
 679 
 680 
 681 def moments(x, axes, name=None, keep_dims=False):
 682   """Calculate the mean and variance of `x`.
 683 
 684   The mean and variance are calculated by aggregating the contents of `x`
 685   across `axes`.  If `x` is 1-D and `axes = [0]` this is just the mean
 686   and variance of a vector.
 687 
 688   When using these moments for batch normalization (see
 689   `tf.nn.batch_normalization`):
 690     * for so-called "global normalization", used with convolutional filters with
 691       shape `[batch, height, width, depth]`, pass `axes=[0, 1, 2]`.
 692     * for simple batch normalization pass `axes=[0]` (batch only).
 693 
 694   Args:
 695     x: A `Tensor`.
 696     axes: array of ints.  Axes along which to compute mean and
 697       variance.
 698     keep_dims: produce moments with the same dimensionality as the input.
 699     name: Name used to scope the operations that compute the moments.
 700 
 701   Returns:
 702     Two `Tensor` objects: `mean` and `variance`.
 703   """
 704   with ops.op_scope([x, axes], name, "moments"):
 705     counts, m_ss, v_ss, shift = sufficient_statistics(x,
 706                                                       axes,
 707                                                       keep_dims=keep_dims,
 708                                                       name=name)
 709     return normalize_moments(counts, m_ss, v_ss, shift, name=name)
 710 
 711 
 712 def batch_normalization(x,
 713                         mean,
 714                         variance,
 715                         offset,
 716                         scale,
 717                         variance_epsilon,
 718                         name=None):
 719   """Batch normalization.
 720 
 721   As described in http://arxiv.org/abs/1502.03167.
 722   Normalizes a tensor by `mean` and `variance`, and applies (optionally) a
 723   `scale` \\\\(\gamma\\\\) to it, as well as an `offset` \\\\(\\beta\\\\):
 724 
 725   \\\\(\\frac{\gamma(x-\mu)}{\sigma}+\\beta\\\\)
 726 
 727   `mean`, `variance`, `offset` and `scale` are all expected to be of one of two
 728   shapes:
 729     * In all generality, they can have the same number of dimensions as the
 730       input `x`, with identical sizes as `x` for the dimensions that are not
 731       normalized over (the 'depth' dimension(s)), and dimension 1 for the
 732       others which are being normalized over.
 733       `mean` and `variance` in this case would typically be the outputs of
 734       `tf.nn.moments(..., keep_dims=True)` during training, or running averages
 735       thereof during inference.
 736     * In the common case where the 'depth' dimension is the last dimension in
 737       the input tensor `x`, they may be one dimensional tensors of the same
 738       size as the 'depth' dimension.
 739       This is the case for example for the common `[batch, depth]` layout of
 740       fully-connected layers, and `[batch, height, width, depth]` for
 741       convolutions.
 742       `mean` and `variance` in this case would typically be the outputs of
 743       `tf.nn.moments(..., keep_dims=False)` during training, or running averages
 744       thereof during inference.
 745 
 746   Args:
 747     x: Input `Tensor` of arbitrary dimensionality.
 748     mean: A mean `Tensor`.
 749     variance: A variance `Tensor`.
 750     offset: An offset `Tensor`, often denoted \\\\(\\beta\\\\) in equations, or
 751       None. If present, will be added to the normalized tensor.
 752     scale: A scale `Tensor`, often denoted \\\\(\gamma\\\\) in equations, or
 753       `None`. If present, the scale is applied to the normalized tensor.
 754     variance_epsilon: A small float number to avoid dividing by 0.
 755     name: A name for this operation (optional).
 756 
 757   Returns:
 758     the normalized, scaled, offset tensor.
 759   """
 760   with ops.op_scope([x, mean, variance, scale, offset], name, "batchnorm"):
 761     inv = math_ops.rsqrt(variance + variance_epsilon)
 762     if scale is not None:
 763       inv *= scale
 764     return x * inv + (
 765         offset - mean * inv if offset is not None else -mean * inv)
 766 
 767 
 768 def batch_norm_with_global_normalization(t,
 769                                          m,
 770                                          v,
 771                                          beta,
 772                                          gamma,
 773                                          variance_epsilon,
 774                                          scale_after_normalization,
 775                                          name=None):
 776   """Batch normalization.
 777 
 778   This op is deprecated. See `tf.nn.batch_normalization`.
 779 
 780   Args:
 781     t: A 4D input Tensor.
 782     m: A 1D mean Tensor with size matching the last dimension of t.
 783       This is the first output from tf.nn.moments,
 784       or a saved moving average thereof.
 785     v: A 1D variance Tensor with size matching the last dimension of t.
 786       This is the second output from tf.nn.moments,
 787       or a saved moving average thereof.
 788     beta: A 1D beta Tensor with size matching the last dimension of t.
 789       An offset to be added to the normalized tensor.
 790     gamma: A 1D gamma Tensor with size matching the last dimension of t.
 791       If "scale_after_normalization" is true, this tensor will be multiplied
 792       with the normalized tensor.
 793     variance_epsilon: A small float number to avoid dividing by 0.
 794     scale_after_normalization: A bool indicating whether the resulted tensor
 795       needs to be multiplied with gamma.
 796     name: A name for this operation (optional).
 797 
 798    Returns:
 799      A batch-normalized `t`.
 800   """
 801   return batch_normalization(t, m, v, beta, gamma if scale_after_normalization
 802                              else None, variance_epsilon, name)
 803 
 804 
 805 def _sum_rows(x):
 806   """Returns a vector summing up each row of the matrix x."""
 807   # _sum_rows(x) is equivalent to math_ops.reduce_sum(x, 1) when x is
 808   # a matrix.  The gradient of _sum_rows(x) is more efficient than
 809   # reduce_sum(x, 1)'s gradient in today's implementation. Therefore,
 810   # we use _sum_rows(x) in the nce_loss() computation since the loss
 811   # is mostly used for training.
 812   cols = array_ops.shape(x)[1]
 813   ones_shape = array_ops.pack([cols, 1])
 814   ones = array_ops.ones(ones_shape, x.dtype)
 815   return array_ops.reshape(math_ops.matmul(x, ones), [-1])
 816 
 817 
 818 def _compute_sampled_logits(weights, biases, inputs, labels, num_sampled,
 819                             num_classes, num_true=1,
 820                             sampled_values=None,
 821                             subtract_log_q=True,
 822                             remove_accidental_hits=False,
 823                             partition_strategy="mod",
 824                             name=None):
 825   """Helper function for nce_loss and sampled_softmax_loss functions.
 826 
 827   Computes sampled output training logits and labels suitable for implementing
 828   e.g. noise-contrastive estimation (see nce_loss) or sampled softmax (see
 829   sampled_softmax_loss).
 830 
 831   Note: In the case where num_true > 1, we assign to each target class
 832   the target probability 1 / num_true so that the target probabilities
 833   sum to 1 per-example.
 834 
 835   Args:
 836     weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
 837         objects whose concatenation along dimension 0 has shape
 838         `[num_classes, dim]`.  The (possibly-partitioned) class embeddings.
 839     biases: A `Tensor` of shape `[num_classes]`.  The class biases.
 840     inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward
 841         activations of the input network.
 842     labels: A `Tensor` of type `int64` and shape `[batch_size,
 843         num_true]`. The target classes.  Note that this format differs from
 844         the `labels` argument of `nn.softmax_cross_entropy_with_logits`.
 845     num_sampled: An `int`.  The number of classes to randomly sample per batch.
 846     num_classes: An `int`. The number of possible classes.
 847     num_true: An `int`.  The number of target classes per training example.
 848     sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,
 849         `sampled_expected_count`) returned by a `*_candidate_sampler` function.
 850         (if None, we default to `log_uniform_candidate_sampler`)
 851     subtract_log_q: A `bool`.  whether to subtract the log expected count of
 852         the labels in the sample to get the logits of the true labels.
 853         Default is True.  Turn off for Negative Sampling.
 854     remove_accidental_hits:  A `bool`.  whether to remove "accidental hits"
 855         where a sampled class equals one of the target classes.  Default is
 856         False.
 857     partition_strategy: A string specifying the partitioning strategy, relevant
 858         if `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.
 859         Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.
 860     name: A name for the operation (optional).
 861   Returns:
 862     out_logits, out_labels: `Tensor` objects each with shape
 863         `[batch_size, num_true + num_sampled]`, for passing to either
 864         `nn.sigmoid_cross_entropy_with_logits` (NCE) or
 865         `nn.softmax_cross_entropy_with_logits` (sampled softmax).
 866   """
 867 
 868   if not isinstance(weights, list):
 869     weights = [weights]
 870 
 871   with ops.op_scope(
 872       weights + [biases, inputs, labels], name, "compute_sampled_logits"):
 873     if labels.dtype != dtypes.int64:
 874       labels = math_ops.cast(labels, dtypes.int64)
 875     labels_flat = array_ops.reshape(labels, [-1])
 876 
 877     # Sample the negative labels.
 878     #   sampled shape: [num_sampled] tensor
 879     #   true_expected_count shape = [batch_size, 1] tensor
 880     #   sampled_expected_count shape = [num_sampled] tensor
 881     if sampled_values is None:
 882       sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
 883           true_classes=labels,
 884           num_true=num_true,
 885           num_sampled=num_sampled,
 886           unique=True,
 887           range_max=num_classes)
 888     # NOTE: pylint cannot tell that 'sampled_values' is a sequence
 889     # pylint: disable=unpacking-non-sequence
 890     sampled, true_expected_count, sampled_expected_count = sampled_values
 891     # pylint: enable=unpacking-non-sequence
 892 
 893     # labels_flat is a [batch_size * num_true] tensor
 894     # sampled is a [num_sampled] int tensor
 895     all_ids = array_ops.concat(0, [labels_flat, sampled])
 896 
 897     # weights shape is [num_classes, dim]
 898     all_w = embedding_ops.embedding_lookup(
 899         weights, all_ids, partition_strategy=partition_strategy)
 900     all_b = embedding_ops.embedding_lookup(biases, all_ids)
 901     # true_w shape is [batch_size * num_true, dim]
 902     # true_b is a [batch_size * num_true] tensor
 903     true_w = array_ops.slice(
 904         all_w, [0, 0], array_ops.pack([array_ops.shape(labels_flat)[0], -1]))
 905     true_b = array_ops.slice(all_b, [0], array_ops.shape(labels_flat))
 906 
 907     # inputs shape is [batch_size, dim]
 908     # true_w shape is [batch_size * num_true, dim]
 909     # row_wise_dots is [batch_size, num_true, dim]
 910     dim = array_ops.shape(true_w)[1:2]
 911     new_true_w_shape = array_ops.concat(0, [[-1, num_true], dim])
 912     row_wise_dots = math_ops.mul(
 913         array_ops.expand_dims(inputs, 1),
 914         array_ops.reshape(true_w, new_true_w_shape))
 915     # We want the row-wise dot plus biases which yields a
 916     # [batch_size, num_true] tensor of true_logits.
 917     dots_as_matrix = array_ops.reshape(row_wise_dots,
 918                                        array_ops.concat(0, [[-1], dim]))
 919     true_logits = array_ops.reshape(_sum_rows(dots_as_matrix), [-1, num_true])
 920     true_b = array_ops.reshape(true_b, [-1, num_true])
 921     true_logits += true_b
 922 
 923     # Lookup weights and biases for sampled labels.
 924     #   sampled_w shape is [num_sampled, dim]
 925     #   sampled_b is a [num_sampled] float tensor
 926     sampled_w = array_ops.slice(
 927         all_w, array_ops.pack([array_ops.shape(labels_flat)[0], 0]), [-1, -1])
 928     sampled_b = array_ops.slice(all_b, array_ops.shape(labels_flat), [-1])
 929 
 930     # inputs has shape [batch_size, dim]
 931     # sampled_w has shape [num_sampled, dim]
 932     # sampled_b has shape [num_sampled]
 933     # Apply X*W'+B, which yields [batch_size, num_sampled]
 934     sampled_logits = math_ops.matmul(inputs,
 935                                      sampled_w,
 936                                      transpose_b=True) + sampled_b
 937 
 938     if remove_accidental_hits:
 939       acc_hits = candidate_sampling_ops.compute_accidental_hits(
 940           labels, sampled, num_true=num_true)
 941       acc_indices, acc_ids, acc_weights = acc_hits
 942 
 943       # This is how SparseToDense expects the indices.
 944       acc_indices_2d = array_ops.reshape(acc_indices, [-1, 1])
 945       acc_ids_2d_int32 = array_ops.reshape(math_ops.cast(
 946           acc_ids, dtypes.int32), [-1, 1])
 947       sparse_indices = array_ops.concat(
 948           1, [acc_indices_2d, acc_ids_2d_int32], "sparse_indices")
 949       # Create sampled_logits_shape = [batch_size, num_sampled]
 950       sampled_logits_shape = array_ops.concat(
 951           0,
 952           [array_ops.shape(labels)[:1], array_ops.expand_dims(num_sampled, 0)])
 953       if sampled_logits.dtype != acc_weights.dtype:
 954         acc_weights = math_ops.cast(acc_weights, sampled_logits.dtype)
 955       sampled_logits += sparse_ops.sparse_to_dense(
 956           sparse_indices, sampled_logits_shape, acc_weights,
 957           default_value=0.0, validate_indices=False)
 958 
 959     if subtract_log_q:
 960       # Subtract log of Q(l), prior probability that l appears in sampled.
 961       true_logits -= math_ops.log(true_expected_count)
 962       sampled_logits -= math_ops.log(sampled_expected_count)
 963 
 964     # Construct output logits and labels. The true labels/logits start at col 0.
 965     out_logits = array_ops.concat(1, [true_logits, sampled_logits])
 966     # true_logits is a float tensor, ones_like(true_logits) is a float tensor
 967     # of ones. We then divide by num_true to ensure the per-example labels sum
 968     # to 1.0, i.e. form a proper probability distribution.
 969     out_labels = array_ops.concat(
 970         1, [array_ops.ones_like(true_logits) / num_true,
 971             array_ops.zeros_like(sampled_logits)])
 972 
 973   return out_logits, out_labels
 974 
 975 
 976 def nce_loss(weights, biases, inputs, labels, num_sampled, num_classes,
 977              num_true=1,
 978              sampled_values=None,
 979              remove_accidental_hits=False,
 980              partition_strategy="mod",
 981              name="nce_loss"):
 982   """Computes and returns the noise-contrastive estimation training loss.
 983 
 984   See [Noise-contrastive estimation: A new estimation principle for
 985   unnormalized statistical models]
 986   (http://www.jmlr.org/proceedings/papers/v9/gutmann10a/gutmann10a.pdf).
 987   Also see our [Candidate Sampling Algorithms Reference]
 988   (../../extras/candidate_sampling.pdf)
 989 
 990   Note: In the case where `num_true` > 1, we assign to each target class
 991   the target probability 1 / `num_true` so that the target probabilities
 992   sum to 1 per-example.
 993 
 994   Note: It would be useful to allow a variable number of target classes per
 995   example.  We hope to provide this functionality in a future release.
 996   For now, if you have a variable number of target classes, you can pad them
 997   out to a constant number by either repeating them or by padding
 998   with an otherwise unused class.
 999 
1000   Args:
1001     weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
1002         objects whose concatenation along dimension 0 has shape
1003         [num_classes, dim].  The (possibly-partitioned) class embeddings.
1004     biases: A `Tensor` of shape `[num_classes]`.  The class biases.
1005     inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward
1006         activations of the input network.
1007     labels: A `Tensor` of type `int64` and shape `[batch_size,
1008         num_true]`. The target classes.
1009     num_sampled: An `int`.  The number of classes to randomly sample per batch.
1010     num_classes: An `int`. The number of possible classes.
1011     num_true: An `int`.  The number of target classes per training example.
1012     sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,
1013         `sampled_expected_count`) returned by a `*_candidate_sampler` function.
1014         (if None, we default to `log_uniform_candidate_sampler`)
1015     remove_accidental_hits:  A `bool`.  Whether to remove "accidental hits"
1016         where a sampled class equals one of the target classes.  If set to
1017         `True`, this is a "Sampled Logistic" loss instead of NCE, and we are
1018         learning to generate log-odds instead of log probabilities.  See
1019         our [Candidate Sampling Algorithms Reference]
1020         (../../extras/candidate_sampling.pdf).
1021         Default is False.
1022     partition_strategy: A string specifying the partitioning strategy, relevant
1023         if `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.
1024         Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.
1025     name: A name for the operation (optional).
1026 
1027   Returns:
1028     A `batch_size` 1-D tensor of per-example NCE losses.
1029   """
1030   logits, labels = _compute_sampled_logits(
1031       weights, biases, inputs, labels, num_sampled, num_classes,
1032       num_true=num_true,
1033       sampled_values=sampled_values,
1034       subtract_log_q=True,
1035       remove_accidental_hits=remove_accidental_hits,
1036       partition_strategy=partition_strategy,
1037       name=name)
1038   sampled_losses = sigmoid_cross_entropy_with_logits(logits,
1039                                                      labels,
1040                                                      name="sampled_losses")
1041   # sampled_losses is batch_size x {true_loss, sampled_losses...}
1042   # We sum out true and sampled losses.
1043   return _sum_rows(sampled_losses)
1044 
1045 
1046 def sampled_softmax_loss(weights, biases, inputs, labels, num_sampled,
1047                          num_classes, num_true=1,
1048                          sampled_values=None,
1049                          remove_accidental_hits=True,
1050                          partition_strategy="mod",
1051                          name="sampled_softmax_loss"):
1052   """Computes and returns the sampled softmax training loss.
1053 
1054   This is a faster way to train a softmax classifier over a huge number of
1055   classes.
1056 
1057   This operation is for training only.  It is generally an underestimate of
1058   the full softmax loss.
1059 
1060   At inference time, you can compute full softmax probabilities with the
1061   expression `tf.nn.softmax(tf.matmul(inputs, weights) + biases)`.
1062 
1063   See our [Candidate Sampling Algorithms Reference]
1064   (../../extras/candidate_sampling.pdf)
1065 
1066   Also see Section 3 of [Jean et al., 2014](http://arxiv.org/abs/1412.2007)
1067   ([pdf](http://arxiv.org/pdf/1412.2007.pdf)) for the math.
1068 
1069   Args:
1070     weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
1071         objects whose concatenation along dimension 0 has shape
1072         [num_classes, dim].  The (possibly-sharded) class embeddings.
1073     biases: A `Tensor` of shape `[num_classes]`.  The class biases.
1074     inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward
1075         activations of the input network.
1076     labels: A `Tensor` of type `int64` and shape `[batch_size,
1077         num_true]`. The target classes.  Note that this format differs from
1078         the `labels` argument of `nn.softmax_cross_entropy_with_logits`.
1079     num_sampled: An `int`.  The number of classes to randomly sample per batch.
1080     num_classes: An `int`. The number of possible classes.
1081     num_true: An `int`.  The number of target classes per training example.
1082     sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,
1083         `sampled_expected_count`) returned by a `*_candidate_sampler` function.
1084         (if None, we default to `log_uniform_candidate_sampler`)
1085     remove_accidental_hits:  A `bool`.  whether to remove "accidental hits"
1086         where a sampled class equals one of the target classes.  Default is
1087         True.
1088     partition_strategy: A string specifying the partitioning strategy, relevant
1089         if `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.
1090         Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.
1091     name: A name for the operation (optional).
1092 
1093   Returns:
1094     A `batch_size` 1-D tensor of per-example sampled softmax losses.

  logits, labels = _compute_sampled_logits(
      weights, biases, inputs, labels, num_sampled, num_classes,
      num_true=num_true,
      sampled_values=sampled_values,
      subtract_log_q=True,
      remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy,
      name=name)
  sampled_losses = nn_ops.softmax_cross_entropy_with_logits(logits, labels)
  # sampled_losses is a [batch_size] tensor.
  return sampled_losses


# TODO(cwhipkey): sigmoid and tanh should not be exposed from tf.nn.
__all__ = make_all(__name__)
__all__.append("zero_fraction")  # documented in training.py

# Modules whitelisted for reference through tf.nn.
# TODO(cwhipkey): migrate callers to use the submodule directly.
__all__.extend(["nn_ops", "rnn_cell", "seq2seq"])

# Symbols whitelisted for export without documentation.
# TODO(cwhipkey): review these and move to contrib or expose through
# documentation.
__all__.extend([
    "all_candidate_sampler",
    "batch_norm_with_global_normalization",
    "batch_normalization",
    "bidirectional_rnn",
    "conv2d_backprop_filter",
    "conv2d_backprop_input",
    "depthwise_conv2d_native",
    "dynamic_rnn",
    "lrn",
    "relu_layer",
    "rnn",
    "state_saving_rnn",
    "xw_plus_b",
])

来源：https://www.cnblogs.com/pypypy/p/11860967.html

标签

tensorflow

log

axes

mean