看过前面的例子,会发现实现深度神经网络需要使用 tensorflow.nn 这个核心模块。我们通过源码来一探究竟。
1 # Copyright 2015 Google Inc. All Rights Reserved.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
4 # you may not use this file except in compliance with the License.
5 # You may obtain a copy of the License at
6 #
7 # http://www.apache.org/licenses/LICENSE-2.0
8 #
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14 # ==============================================================================
15
16 # pylint: disable=unused-import,g-bad-import-order
17 """## Activation Functions
18
19 The activation ops provide different types of nonlinearities for use in neural
20 networks. These include smooth nonlinearities (`sigmoid`, `tanh`, `elu`,
21 `softplus`, and `softsign`), continuous but not everywhere differentiable
22 functions (`relu`, `relu6`, and `relu_x`), and random regularization
23 (`dropout`).
24
25 All activation ops apply componentwise, and produce a tensor of the same
26 shape as the input tensor.
27
28 @@relu
29 @@relu6
30 @@elu
31 @@softplus
32 @@softsign
33 @@dropout
34 @@bias_add
35 @@sigmoid
36 @@tanh
37
38 ## Convolution
39
40 The convolution ops sweep a 2-D filter over a batch of images, applying the
41 filter to each window of each image of the appropriate size. The different
42 ops trade off between generic vs. specific filters:
43
44 * `conv2d`: Arbitrary filters that can mix channels together.
45 * `depthwise_conv2d`: Filters that operate on each channel independently.
46 * `separable_conv2d`: A depthwise spatial filter followed by a pointwise filter.
47
48 Note that although these ops are called "convolution", they are strictly
49 speaking "cross-correlation" since the filter is combined with an input window
50 without reversing the filter. For details, see [the properties of
51 cross-correlation](https://en.wikipedia.org/wiki/Cross-correlation#Properties).
52
53 The filter is applied to image patches of the same size as the filter and
54 strided according to the `strides` argument. `strides = [1, 1, 1, 1]` applies
55 the filter to a patch at every offset, `strides = [1, 2, 2, 1]` applies the
56 filter to every other image patch in each dimension, etc.
57
58 Ignoring channels for the moment, and assume that the 4-D `input` has shape
59 `[batch, in_height, in_width, ...]` and the 4-D `filter` has shape
60 `[filter_height, filter_width, ...]`, then the spatial semantics of the
61 convolution ops are as follows: first, according to the padding scheme chosen
62 as `'SAME'` or `'VALID'`, the output size and the padding pixels are computed.
63 For the `'SAME'` padding, the output height and width are computed as:
64
65 out_height = ceil(float(in_height) / float(strides[1]))
66 out_width = ceil(float(in_width) / float(strides[2]))
67
68 and the padding on the top and left are computed as:
69
70 pad_along_height = ((out_height - 1) * strides[1] +
71 filter_height - in_height)
72 pad_along_width = ((out_width - 1) * strides[2] +
73 filter_width - in_width)
74 pad_top = pad_along_height / 2
75 pad_left = pad_along_width / 2
76
77 Note that the division by 2 means that there might be cases when the padding on
78 both sides (top vs bottom, right vs left) are off by one. In this case, the
79 bottom and right sides always get the one additional padded pixel. For example,
80 when `pad_along_height` is 5, we pad 2 pixels at the top and 3 pixels at the
81 bottom. Note that this is different from existing libraries such as cuDNN and
82 Caffe, which explicitly specify the number of padded pixels and always pad the
83 same number of pixels on both sides.
84
85 For the `'VALID`' padding, the output height and width are computed as:
86
87 out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
88 out_width = ceil(float(in_width - filter_width + 1) / float(strides[2]))
89
90 and the padding values are always zero. The output is then computed as
91
92 output[b, i, j, :] =
93 sum_{di, dj} input[b, strides[1] * i + di - pad_top,
94 strides[2] * j + dj - pad_left, ...] *
95 filter[di, dj, ...]
96
97 where any value outside the original input image region are considered zero (
98 i.e. we pad zero values around the border of the image).
99
100 Since `input` is 4-D, each `input[b, i, j, :]` is a vector. For `conv2d`, these
101 vectors are multiplied by the `filter[di, dj, :, :]` matrices to produce new
102 vectors. For `depthwise_conv_2d`, each scalar component `input[b, i, j, k]`
103 is multiplied by a vector `filter[di, dj, k]`, and all the vectors are
104 concatenated.
105
106 @@conv2d
107 @@depthwise_conv2d
108 @@separable_conv2d
109 @@conv2d_transpose
110
111 ## Pooling
112
113 The pooling ops sweep a rectangular window over the input tensor, computing a
114 reduction operation for each window (average, max, or max with argmax). Each
115 pooling op uses rectangular windows of size `ksize` separated by offset
116 `strides`. For example, if `strides` is all ones every window is used, if
117 `strides` is all twos every other window is used in each dimension, etc.
118
119 In detail, the output is
120
121 output[i] = reduce(value[strides * i:strides * i + ksize])
122
123 where the indices also take into consideration the padding values. Please refer
124 to the `Convolution` section for details about the padding calculation.
125
126 @@avg_pool
127 @@max_pool
128 @@max_pool_with_argmax
129
130 ## Normalization
131
132 Normalization is useful to prevent neurons from saturating when inputs may
133 have varying scale, and to aid generalization.
134
135 @@l2_normalize
136 @@local_response_normalization
137 @@sufficient_statistics
138 @@normalize_moments
139 @@moments
140
141 ## Losses
142
143 The loss ops measure error between two tensors, or between a tensor and zero.
144 These can be used for measuring accuracy of a network in a regression task
145 or for regularization purposes (weight decay).
146
147 @@l2_loss
148
149 ## Classification
150
151 TensorFlow provides several operations that help you perform classification.
152
153 @@sigmoid_cross_entropy_with_logits
154 @@softmax
155 @@log_softmax
156 @@softmax_cross_entropy_with_logits
157 @@sparse_softmax_cross_entropy_with_logits
158 @@weighted_cross_entropy_with_logits
159
160 ## Embeddings
161
162 TensorFlow provides library support for looking up values in embedding
163 tensors.
164
165 @@embedding_lookup
166 @@embedding_lookup_sparse
167
168 ## Evaluation
169
170 The evaluation ops are useful for measuring the performance of a network.
171 Since they are nondifferentiable, they are typically used at evaluation time.
172
173 @@top_k
174 @@in_top_k
175
176 ## Candidate Sampling
177
178 Do you want to train a multiclass or multilabel model with thousands
179 or millions of output classes (for example, a language model with a
180 large vocabulary)? Training with a full Softmax is slow in this case,
181 since all of the classes are evaluated for every training example.
182 Candidate Sampling training algorithms can speed up your step times by
183 only considering a small randomly-chosen subset of contrastive classes
184 (called candidates) for each batch of training examples.
185
186 See our [Candidate Sampling Algorithms Reference]
187 (../../extras/candidate_sampling.pdf)
188
189 ### Sampled Loss Functions
190
191 TensorFlow provides the following sampled loss functions for faster training.
192
193 @@nce_loss
194 @@sampled_softmax_loss
195
196 ### Candidate Samplers
197
198 TensorFlow provides the following samplers for randomly sampling candidate
199 classes when using one of the sampled loss functions above.
200
201 @@uniform_candidate_sampler
202 @@log_uniform_candidate_sampler
203 @@learned_unigram_candidate_sampler
204 @@fixed_unigram_candidate_sampler
205
206 ### Miscellaneous candidate sampling utilities
207
208 @@compute_accidental_hits
209
210 """
211 from __future__ import absolute_import
212 from __future__ import division
213 from __future__ import print_function
214
215 from six.moves import xrange # pylint: disable=redefined-builtin
216
217 from tensorflow.python.framework import dtypes
218 from tensorflow.python.framework import ops
219 from tensorflow.python.framework import tensor_shape
220 from tensorflow.python.ops import array_ops
221 from tensorflow.python.ops import candidate_sampling_ops
222 from tensorflow.python.ops import constant_op
223 from tensorflow.python.ops import control_flow_ops
224 from tensorflow.python.ops import embedding_ops
225 from tensorflow.python.ops import init_ops
226 from tensorflow.python.ops import math_ops
227 from tensorflow.python.ops import nn_grad
228 from tensorflow.python.ops import nn_ops
229 from tensorflow.python.ops import numerics
230 from tensorflow.python.ops import random_ops
231 from tensorflow.python.ops import rnn_cell
232 from tensorflow.python.ops import seq2seq
233 from tensorflow.python.ops import sparse_ops
234 from tensorflow.python.ops import variable_scope as vs
235 from tensorflow.python.ops.math_ops import sigmoid
236 from tensorflow.python.ops.math_ops import tanh
237 from tensorflow.python.util.all_util import make_all
238
239 # Bring more nn-associated functionality into this package.
240 # go/tf-wildcard-import
241 # pylint: disable=wildcard-import
242 from tensorflow.python.ops.nn_ops import *
243 from tensorflow.python.ops.candidate_sampling_ops import *
244 from tensorflow.python.ops.embedding_ops import *
245 from tensorflow.python.ops.rnn import *
246 # pylint: enable=wildcard-import
247
248
249 def sigmoid_cross_entropy_with_logits(logits, targets, name=None):
250 """Computes sigmoid cross entropy given `logits`.
251
252 Measures the probability error in discrete classification tasks in which each
253 class is independent and not mutually exclusive. For instance, one could
254 perform multilabel classification where a picture can contain both an elephant
255 and a dog at the same time.
256
257 For brevity, let `x = logits`, `z = targets`. The logistic loss is
258
259 z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
260 = z * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
261 = z * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
262 = z * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
263 = (1 - z) * x + log(1 + exp(-x))
264 = x - x * z + log(1 + exp(-x))
265
266 To ensure stability and avoid overflow, the implementation uses
267
268 max(x, 0) - x * z + log(1 + exp(-abs(x)))
269
270 `logits` and `targets` must have the same type and shape.
271
272 Args:
273 logits: A `Tensor` of type `float32` or `float64`.
274 targets: A `Tensor` of the same type and shape as `logits`.
275 name: A name for the operation (optional).
276
277 Returns:
278 A `Tensor` of the same shape as `logits` with the componentwise
279 logistic losses.
280
281 Raises:
282 ValueError: If `logits` and `targets` do not have the same shape.
283 """
284 with ops.op_scope([logits, targets], name, "logistic_loss") as name:
285 logits = ops.convert_to_tensor(logits, name="logits")
286 targets = ops.convert_to_tensor(targets, name="targets")
287 try:
288 targets.get_shape().merge_with(logits.get_shape())
289 except ValueError:
290 raise ValueError(
291 "logits and targets must have the same shape (%s vs %s)"
292 % (logits.get_shape(), targets.get_shape()))
293
294 # The logistic loss formula from above is
295 # x - x * z + log(1 + exp(-x))
296 # For x < 0, a more numerically stable formula is
297 # -x * z + log(1 + exp(x))
298 # To avoid branching, we use the combined version
299 # max(x, 0) - x * z + log(1 + exp(-abs(x)))
300 return math_ops.add(nn_ops.relu(logits) - logits * targets,
301 math_ops.log(1 + math_ops.exp(-math_ops.abs(logits))),
302 name=name)
303
304
305 def weighted_cross_entropy_with_logits(logits, targets, pos_weight,
306 name=None):
307 """Computes a weighted cross entropy.
308
309 This is like `sigmoid_cross_entropy_with_logits()` except that `pos_weight`,
310 allows one to trade off recall and precision by up- or down-weighting the
311 cost of a positive error relative to a negative error.
312
313 The usual cross-entropy cost is defined as:
314
315 targets * -log(sigmoid(logits)) + (1 - targets) * -log(1 - sigmoid(logits))
316
317 The argument `pos_weight` is used as a multiplier for the positive targets:
318
319 targets * -log(sigmoid(logits)) * pos_weight +
320 (1 - targets) * -log(1 - sigmoid(logits))
321
322 For brevity, let `x = logits`, `z = targets`, `q = pos_weight`.
323 The loss is:
324
325 qz * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
326 = qz * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
327 = qz * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
328 = qz * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
329 = (1 - z) * x + (qz + 1 - z) * log(1 + exp(-x))
330 = (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(-x))
331
332 Setting `l = (1 + (q - 1) * z)`, to ensure stability and avoid overflow,
333 the implementation uses
334
335 (1 - z) * x + l * (log(1 + exp(-abs(x))) + max(-x, 0))
336
337 `logits` and `targets` must have the same type and shape.
338
339 Args:
340 logits: A `Tensor` of type `float32` or `float64`.
341 targets: A `Tensor` of the same type and shape as `logits`.
342 pos_weight: A coefficient to use on the positive examples.
343 name: A name for the operation (optional).
344
345 Returns:
346 A `Tensor` of the same shape as `logits` with the componentwise
347 weightedlogistic losses.
348
349 Raises:
350 ValueError: If `logits` and `targets` do not have the same shape.
351 """
352 with ops.op_scope([logits, targets], name, "logistic_loss") as name:
353 logits = ops.convert_to_tensor(logits, name="logits")
354 targets = ops.convert_to_tensor(targets, name="targets")
355 try:
356 targets.get_shape().merge_with(logits.get_shape())
357 except ValueError:
358 raise ValueError(
359 "logits and targets must have the same shape (%s vs %s)"
360 % (logits.get_shape(), targets.get_shape()))
361
362 # The logistic loss formula from above is
363 # (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(-x))
364 # For x < 0, a more numerically stable formula is
365 # (1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(x)) - l * x
366 # To avoid branching, we use the combined version
367 # (1 - z) * x + l * (log(1 + exp(-abs(x))) + max(-x, 0))
368 log_weight = 1 + (pos_weight - 1) * targets
369 return math_ops.add(
370 (1 - targets) * logits,
371 log_weight * (math_ops.log(1 + math_ops.exp(-math_ops.abs(logits))) +
372 nn_ops.relu(-logits)),
373 name=name)
374
375
376 def relu_layer(x, weights, biases, name=None):
377 """Computes Relu(x * weight + biases).
378
379 Args:
380 x: a 2D tensor. Dimensions typically: batch, in_units
381 weights: a 2D tensor. Dimensions typically: in_units, out_units
382 biases: a 1D tensor. Dimensions: out_units
383 name: A name for the operation (optional). If not specified
384 "nn_relu_layer" is used.
385
386 Returns:
387 A 2-D Tensor computing relu(matmul(x, weights) + biases).
388 Dimensions typically: batch, out_units.
389 """
390 with ops.op_scope([x, weights, biases], name, "relu_layer") as name:
391 x = ops.convert_to_tensor(x, name="x")
392 weights = ops.convert_to_tensor(weights, name="weights")
393 biases = ops.convert_to_tensor(biases, name="biases")
394 xw_plus_b = nn_ops.bias_add(math_ops.matmul(x, weights), biases)
395 return nn_ops.relu(xw_plus_b, name=name)
396
397
398 def l2_normalize(x, dim, epsilon=1e-12, name=None):
399 """Normalizes along dimension `dim` using an L2 norm.
400
401 For a 1-D tensor with `dim = 0`, computes
402
403 output = x / sqrt(max(sum(x**2), epsilon))
404
405 For `x` with more dimensions, independently normalizes each 1-D slice along
406 dimension `dim`.
407
408 Args:
409 x: A `Tensor`.
410 dim: Dimension along which to normalize.
411 epsilon: A lower bound value for the norm. Will use `sqrt(epsilon)` as the
412 divisor if `norm < sqrt(epsilon)`.
413 name: A name for this operation (optional).
414
415 Returns:
416 A `Tensor` with the same shape as `x`.
417 """
418 with ops.op_scope([x], name, "l2_normalize") as name:
419 x = ops.convert_to_tensor(x, name="x")
420 square_sum = math_ops.reduce_sum(math_ops.square(x), [dim], keep_dims=True)
421 x_inv_norm = math_ops.rsqrt(math_ops.maximum(square_sum, epsilon))
422 return math_ops.mul(x, x_inv_norm, name=name)
423
424
425 def zero_fraction(value, name=None):
426 """Returns the fraction of zeros in `value`.
427
428 If `value` is empty, the result is `nan`.
429
430 This is useful in summaries to measure and report sparsity. For example,
431
432 z = tf.Relu(...)
433 summ = tf.scalar_summary('sparsity', tf.nn.zero_fraction(z))
434
435 Args:
436 value: A tensor of numeric type.
437 name: A name for the operation (optional).
438
439 Returns:
440 The fraction of zeros in `value`, with type `float32`.
441 """
442 with ops.op_scope([value], name, "zero_fraction"):
443 value = ops.convert_to_tensor(value, name="value")
444 zero = constant_op.constant(0, dtype=value.dtype, name="zero")
445 return math_ops.reduce_mean(math_ops.cast(math_ops.equal(value, zero),
446 dtypes.float32))
447
448
449 def depthwise_conv2d(input, filter, strides, padding, name=None):
450 """Depthwise 2-D convolution.
451
452 Given an input tensor of shape `[batch, in_height, in_width, in_channels]`
453 and a filter tensor of shape
454 `[filter_height, filter_width, in_channels, channel_multiplier]`
455 containing `in_channels` convolutional filters of depth 1, `depthwise_conv2d`
456 applies a different filter to each input channel (expanding from 1 channel
457 to `channel_multiplier` channels for each), then concatenates the results
458 together. The output has `in_channels * channel_multiplier` channels.
459
460 In detail,
461
462 output[b, i, j, k * channel_multiplier + q] =
463 sum_{di, dj} input[b, strides[1] * i + di, strides[2] * j + dj, k] *
464 filter[di, dj, k, q]
465
466 Must have `strides[0] = strides[3] = 1`. For the most common case of the
467 same horizontal and vertical strides, `strides = [1, stride, stride, 1]`.
468
469 Args:
470 input: 4-D with shape `[batch, in_height, in_width, in_channels]`.
471 filter: 4-D with shape
472 `[filter_height, filter_width, in_channels, channel_multiplier]`.
473 strides: 1-D of size 4. The stride of the sliding window for each
474 dimension of `input`.
475 padding: A string, either `'VALID'` or `'SAME'`. The padding algorithm.
476 name: A name for this operation (optional).
477
478 Returns:
479 A 4-D `Tensor` of shape
480 `[batch, out_height, out_width, in_channels * channel_multiplier].`
481 """
482 with ops.op_scope([input, filter], name, "depthwise") as name:
483 input = ops.convert_to_tensor(input, name="tensor_in")
484 filter = ops.convert_to_tensor(filter, name="filter_in")
485 # A shape is required to statically compute the number of separable filters.
486 if filter.get_shape().ndims is not None:
487 assert len(filter.get_shape()) == 4
488 in_channels = filter.get_shape()[2]
489 # Sanity checks, if shape information is available for the inputs.
490 if input.get_shape().ndims is not None:
491 assert len(input.get_shape()) == 4
492 assert input.get_shape()[3] == in_channels, (
493 "Mismatched input depth %d and number of depthwise filters %d." % (
494 input.get_shape()[3].value, in_channels))
495 else:
496 assert input.get_shape().ndims is not None, (
497 "Either tensor must provide static shape information.")
498 assert input.get_shape().ndims == 4
499 in_channels = input.get_shape()[3]
500
501 if in_channels == 1:
502 return nn_ops.conv2d(input, filter, strides, padding, name=name)
503 else:
504 # Create one separate convolution per channel.
505 convs = []
506 for channel in xrange(in_channels):
507 with ops.name_scope("depth%d" % channel) as channel_scope:
508 t_in = array_ops.slice(input, [0, 0, 0, channel], [-1, -1, -1, 1],
509 name="slice_inputs")
510 f_in = array_ops.slice(filter, [0, 0, channel, 0], [-1, -1, 1, -1],
511 name="slice_params")
512 convs.append(nn_ops.conv2d(t_in, f_in,
513 strides, padding, name=channel_scope))
514 # Concatenate the per-channel convolutions along the channel dimension.
515 return array_ops.concat(3, convs, name=name)
516
517
518 def separable_conv2d(input, depthwise_filter, pointwise_filter, strides,
519 padding,
520 name=None):
521 """2-D convolution with separable filters.
522
523 Performs a depthwise convolution that acts separately on channels followed by
524 a pointwise convolution that mixes channels. Note that this is separability
525 between dimensions `[1, 2]` and `3`, not spatial separability between
526 dimensions `1` and `2`.
527
528 In detail,
529
530 output[b, i, j, k] = sum_{di, dj, q, r]
531 input[b, strides[1] * i + di, strides[2] * j + dj, q] *
532 depthwise_filter[di, dj, q, r] *
533 pointwise_filter[0, 0, q * channel_multiplier + r, k]
534
535 `strides` controls the strides for the depthwise convolution only, since
536 the pointwise convolution has implicit strides of `[1, 1, 1, 1]`. Must have
537 `strides[0] = strides[3] = 1`. For the most common case of the same
538 horizontal and vertical strides, `strides = [1, stride, stride, 1]`.
539
540 Args:
541 input: 4-D `Tensor` with shape `[batch, in_height, in_width, in_channels]`.
542 depthwise_filter: 4-D `Tensor` with shape
543 `[filter_height, filter_width, in_channels, channel_multiplier]`.
544 Contains `in_channels` convolutional filters of depth 1.
545 pointwise_filter: 4-D `Tensor` with shape
546 `[1, 1, channel_multiplier * in_channels, out_channels]`. Pointwise
547 filter to mix channels after `depthwise_filter` has convolved spatially.
548 strides: 1-D of size 4. The strides for the depthwise convolution for
549 each dimension of `input`.
550 padding: A string, either `'VALID'` or `'SAME'`. The padding algorithm.
551 name: A name for this operation (optional).
552
553 Returns:
554 A 4-D `Tensor` of shape `[batch, out_height, out_width, out_channels]`.
555 """
556 with ops.op_scope([input, depthwise_filter, pointwise_filter],
557 name, "separable_conv2d") as name:
558 input = ops.convert_to_tensor(input, name="tensor_in")
559 depthwise_filter = ops.convert_to_tensor(depthwise_filter,
560 name="depthwise_filter")
561 pointwise_filter = ops.convert_to_tensor(pointwise_filter,
562 name="pointwise_filter")
563
564 if pointwise_filter.get_shape().ndims is not None:
565 assert len(pointwise_filter.get_shape()) == 4
566 assert pointwise_filter.get_shape()[0] == 1
567 assert pointwise_filter.get_shape()[1] == 1
568 if depthwise_filter.get_shape().ndims and input.get_shape().ndims:
569 channel_multiplier = depthwise_filter.get_shape()[3]
570 in_channels = input.get_shape()[3]
571 out_channels = pointwise_filter.get_shape()[3]
572 # This would mean the separable convolutions is over-parametrized.
573 assert channel_multiplier * in_channels < out_channels
574 # The layout of the ops in the graph are expected to be as follows:
575 # separable_conv2d // Conv2D op corresponding to the pointwise conv.
576 # separable_conv2d/depthwise // Concat op for the deptwise outputs.
577 # separable_conv2d/depthwise/depth0 // Conv2D op for depth 0
578 # separable_conv2d/depthwise/depth1 // Conv2D op for depth 1
579 # separable_conv2d/depthwise/depth2 // Conv2D op for depth 2
580 depthwise = depthwise_conv2d(input, depthwise_filter, strides,
581 padding, name="depthwise")
582 return nn_ops.conv2d(depthwise, pointwise_filter, [1, 1, 1, 1],
583 padding="VALID", name=name)
584
585
586 def sufficient_statistics(x, axes, shift=True, keep_dims=False, name=None):
587 """Calculate the sufficient statistics for the mean and variance of `x`.
588
589 These sufficient statistics are computed using the one pass algorithm on
590 an input that's optionally shifted using the value of the 1st element in `x`.
591 See:
592 https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data
593
594 Args:
595 x: A `Tensor`.
596 axes: Array of ints. Axes along which to compute mean and variance.
597 shift: If true, shift the data to provide more numerically stable results.
598 keep_dims: produce statistics with the same dimensionality as the input.
599 name: Name used to scope the operations that compute the sufficient stats.
600
601 Returns:
602 Four `Tensor` objects of the same type as `x`:
603 * the count (number of elements to average over).
604 * the (possibly shifted) sum of the elements in the array.
605 * the (possibly shifted) sum of squares of the elements in the array.
606 * the shift by which the mean must be corrected or None if `shift` is False.
607 """
608 with ops.op_scope([x, axes], name, "sufficient_statistics"):
609 x = ops.convert_to_tensor(x, name="x")
610 x_shape = x.get_shape()
611 if x_shape.is_fully_defined():
612 counts = 1
613 m_shape = []
614 for d in xrange(x_shape.ndims):
615 dim = x_shape[d].value
616 if d in set(axes):
617 counts *= dim
618 dim = 1
619 m_shape.append(dim)
620 counts = constant_op.constant(counts, dtype=x.dtype)
621 else: # shape needs to be inferred at runtime.
622 x_shape = array_ops.shape(x)
623 select_axes = sparse_ops.sparse_to_dense(axes, array_ops.shape(x_shape),
624 True, False)
625 m_shape = math_ops.select(select_axes, array_ops.ones_like(x_shape),
626 x_shape)
627 counts = math_ops.cast(
628 math_ops.reduce_prod(x_shape / m_shape),
629 x.dtype,
630 name="count")
631 if shift:
632 shift_value = array_ops.slice(x, array_ops.zeros_like(m_shape), m_shape)
633 m_ss = math_ops.sub(x, shift_value)
634 v_ss = math_ops.squared_difference(x, shift_value)
635 if keep_dims:
636 shift_value = array_ops.identity(shift_value, name="shift")
637 else:
638 shift_value = array_ops.squeeze(shift_value,
639 squeeze_dims=axes,
640 name="shift")
641 else: # not shift.
642 m_ss = x
643 v_ss = math_ops.square(x)
644 shift_value = None
645 m_ss = math_ops.reduce_sum(m_ss, axes, keep_dims=keep_dims, name="mean_ss")
646 v_ss = math_ops.reduce_sum(v_ss, axes, keep_dims=keep_dims, name="var_ss")
647 return counts, m_ss, v_ss, shift_value
648
649
650 def normalize_moments(counts, mean_ss, variance_ss, shift, name=None):
651 """Calculate the mean and variance of based on the sufficient statistics.
652
653 Args:
654 counts: A `Tensor` containing a the total count of the data (one value).
655 mean_ss: A `Tensor` containing the mean sufficient statistics: the (possibly
656 shifted) sum of the elements to average over.
657 variance_ss: A `Tensor` containing the variance sufficient statistics: the
658 (possibly shifted) squared sum of the data to compute the variance over.
659 shift: A `Tensor` containing the value by which the data is shifted for
660 numerical stability, or `None` if no shift was performed.
661 name: Name used to scope the operations that compute the moments.
662
663 Returns:
664 Two `Tensor` objects: `mean` and `variance`.
665 """
666 with ops.op_scope([counts, mean_ss, variance_ss, shift], name, "normalize"):
667 divisor = math_ops.inv(counts, name="divisor")
668 if shift is not None:
669 shifted_mean = math_ops.mul(mean_ss, divisor, name="shifted_mean")
670 mean = math_ops.add(shifted_mean, shift, name="mean")
671 else: # no shift.
672 shifted_mean = math_ops.mul(mean_ss, divisor, name="mean")
673 mean = shifted_mean
674 variance = math_ops.sub(
675 math_ops.mul(variance_ss, divisor),
676 math_ops.square(shifted_mean),
677 name="variance")
678 return (mean, variance)
679
680
681 def moments(x, axes, name=None, keep_dims=False):
682 """Calculate the mean and variance of `x`.
683
684 The mean and variance are calculated by aggregating the contents of `x`
685 across `axes`. If `x` is 1-D and `axes = [0]` this is just the mean
686 and variance of a vector.
687
688 When using these moments for batch normalization (see
689 `tf.nn.batch_normalization`):
690 * for so-called "global normalization", used with convolutional filters with
691 shape `[batch, height, width, depth]`, pass `axes=[0, 1, 2]`.
692 * for simple batch normalization pass `axes=[0]` (batch only).
693
694 Args:
695 x: A `Tensor`.
696 axes: array of ints. Axes along which to compute mean and
697 variance.
698 keep_dims: produce moments with the same dimensionality as the input.
699 name: Name used to scope the operations that compute the moments.
700
701 Returns:
702 Two `Tensor` objects: `mean` and `variance`.
703 """
704 with ops.op_scope([x, axes], name, "moments"):
705 counts, m_ss, v_ss, shift = sufficient_statistics(x,
706 axes,
707 keep_dims=keep_dims,
708 name=name)
709 return normalize_moments(counts, m_ss, v_ss, shift, name=name)
710
711
712 def batch_normalization(x,
713 mean,
714 variance,
715 offset,
716 scale,
717 variance_epsilon,
718 name=None):
719 """Batch normalization.
720
721 As described in http://arxiv.org/abs/1502.03167.
722 Normalizes a tensor by `mean` and `variance`, and applies (optionally) a
723 `scale` \\\\(\gamma\\\\) to it, as well as an `offset` \\\\(\\beta\\\\):
724
725 \\\\(\\frac{\gamma(x-\mu)}{\sigma}+\\beta\\\\)
726
727 `mean`, `variance`, `offset` and `scale` are all expected to be of one of two
728 shapes:
729 * In all generality, they can have the same number of dimensions as the
730 input `x`, with identical sizes as `x` for the dimensions that are not
731 normalized over (the 'depth' dimension(s)), and dimension 1 for the
732 others which are being normalized over.
733 `mean` and `variance` in this case would typically be the outputs of
734 `tf.nn.moments(..., keep_dims=True)` during training, or running averages
735 thereof during inference.
736 * In the common case where the 'depth' dimension is the last dimension in
737 the input tensor `x`, they may be one dimensional tensors of the same
738 size as the 'depth' dimension.
739 This is the case for example for the common `[batch, depth]` layout of
740 fully-connected layers, and `[batch, height, width, depth]` for
741 convolutions.
742 `mean` and `variance` in this case would typically be the outputs of
743 `tf.nn.moments(..., keep_dims=False)` during training, or running averages
744 thereof during inference.
745
746 Args:
747 x: Input `Tensor` of arbitrary dimensionality.
748 mean: A mean `Tensor`.
749 variance: A variance `Tensor`.
750 offset: An offset `Tensor`, often denoted \\\\(\\beta\\\\) in equations, or
751 None. If present, will be added to the normalized tensor.
752 scale: A scale `Tensor`, often denoted \\\\(\gamma\\\\) in equations, or
753 `None`. If present, the scale is applied to the normalized tensor.
754 variance_epsilon: A small float number to avoid dividing by 0.
755 name: A name for this operation (optional).
756
757 Returns:
758 the normalized, scaled, offset tensor.
759 """
760 with ops.op_scope([x, mean, variance, scale, offset], name, "batchnorm"):
761 inv = math_ops.rsqrt(variance + variance_epsilon)
762 if scale is not None:
763 inv *= scale
764 return x * inv + (
765 offset - mean * inv if offset is not None else -mean * inv)
766
767
768 def batch_norm_with_global_normalization(t,
769 m,
770 v,
771 beta,
772 gamma,
773 variance_epsilon,
774 scale_after_normalization,
775 name=None):
776 """Batch normalization.
777
778 This op is deprecated. See `tf.nn.batch_normalization`.
779
780 Args:
781 t: A 4D input Tensor.
782 m: A 1D mean Tensor with size matching the last dimension of t.
783 This is the first output from tf.nn.moments,
784 or a saved moving average thereof.
785 v: A 1D variance Tensor with size matching the last dimension of t.
786 This is the second output from tf.nn.moments,
787 or a saved moving average thereof.
788 beta: A 1D beta Tensor with size matching the last dimension of t.
789 An offset to be added to the normalized tensor.
790 gamma: A 1D gamma Tensor with size matching the last dimension of t.
791 If "scale_after_normalization" is true, this tensor will be multiplied
792 with the normalized tensor.
793 variance_epsilon: A small float number to avoid dividing by 0.
794 scale_after_normalization: A bool indicating whether the resulted tensor
795 needs to be multiplied with gamma.
796 name: A name for this operation (optional).
797
798 Returns:
799 A batch-normalized `t`.
800 """
801 return batch_normalization(t, m, v, beta, gamma if scale_after_normalization
802 else None, variance_epsilon, name)
803
804
805 def _sum_rows(x):
806 """Returns a vector summing up each row of the matrix x."""
807 # _sum_rows(x) is equivalent to math_ops.reduce_sum(x, 1) when x is
808 # a matrix. The gradient of _sum_rows(x) is more efficient than
809 # reduce_sum(x, 1)'s gradient in today's implementation. Therefore,
810 # we use _sum_rows(x) in the nce_loss() computation since the loss
811 # is mostly used for training.
812 cols = array_ops.shape(x)[1]
813 ones_shape = array_ops.pack([cols, 1])
814 ones = array_ops.ones(ones_shape, x.dtype)
815 return array_ops.reshape(math_ops.matmul(x, ones), [-1])
816
817
818 def _compute_sampled_logits(weights, biases, inputs, labels, num_sampled,
819 num_classes, num_true=1,
820 sampled_values=None,
821 subtract_log_q=True,
822 remove_accidental_hits=False,
823 partition_strategy="mod",
824 name=None):
825 """Helper function for nce_loss and sampled_softmax_loss functions.
826
827 Computes sampled output training logits and labels suitable for implementing
828 e.g. noise-contrastive estimation (see nce_loss) or sampled softmax (see
829 sampled_softmax_loss).
830
831 Note: In the case where num_true > 1, we assign to each target class
832 the target probability 1 / num_true so that the target probabilities
833 sum to 1 per-example.
834
835 Args:
836 weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
837 objects whose concatenation along dimension 0 has shape
838 `[num_classes, dim]`. The (possibly-partitioned) class embeddings.
839 biases: A `Tensor` of shape `[num_classes]`. The class biases.
840 inputs: A `Tensor` of shape `[batch_size, dim]`. The forward
841 activations of the input network.
842 labels: A `Tensor` of type `int64` and shape `[batch_size,
843 num_true]`. The target classes. Note that this format differs from
844 the `labels` argument of `nn.softmax_cross_entropy_with_logits`.
845 num_sampled: An `int`. The number of classes to randomly sample per batch.
846 num_classes: An `int`. The number of possible classes.
847 num_true: An `int`. The number of target classes per training example.
848 sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,
849 `sampled_expected_count`) returned by a `*_candidate_sampler` function.
850 (if None, we default to `log_uniform_candidate_sampler`)
851 subtract_log_q: A `bool`. whether to subtract the log expected count of
852 the labels in the sample to get the logits of the true labels.
853 Default is True. Turn off for Negative Sampling.
854 remove_accidental_hits: A `bool`. whether to remove "accidental hits"
855 where a sampled class equals one of the target classes. Default is
856 False.
857 partition_strategy: A string specifying the partitioning strategy, relevant
858 if `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.
859 Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.
860 name: A name for the operation (optional).
861 Returns:
862 out_logits, out_labels: `Tensor` objects each with shape
863 `[batch_size, num_true + num_sampled]`, for passing to either
864 `nn.sigmoid_cross_entropy_with_logits` (NCE) or
865 `nn.softmax_cross_entropy_with_logits` (sampled softmax).
866 """
867
868 if not isinstance(weights, list):
869 weights = [weights]
870
871 with ops.op_scope(
872 weights + [biases, inputs, labels], name, "compute_sampled_logits"):
873 if labels.dtype != dtypes.int64:
874 labels = math_ops.cast(labels, dtypes.int64)
875 labels_flat = array_ops.reshape(labels, [-1])
876
877 # Sample the negative labels.
878 # sampled shape: [num_sampled] tensor
879 # true_expected_count shape = [batch_size, 1] tensor
880 # sampled_expected_count shape = [num_sampled] tensor
881 if sampled_values is None:
882 sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
883 true_classes=labels,
884 num_true=num_true,
885 num_sampled=num_sampled,
886 unique=True,
887 range_max=num_classes)
888 # NOTE: pylint cannot tell that 'sampled_values' is a sequence
889 # pylint: disable=unpacking-non-sequence
890 sampled, true_expected_count, sampled_expected_count = sampled_values
891 # pylint: enable=unpacking-non-sequence
892
893 # labels_flat is a [batch_size * num_true] tensor
894 # sampled is a [num_sampled] int tensor
895 all_ids = array_ops.concat(0, [labels_flat, sampled])
896
897 # weights shape is [num_classes, dim]
898 all_w = embedding_ops.embedding_lookup(
899 weights, all_ids, partition_strategy=partition_strategy)
900 all_b = embedding_ops.embedding_lookup(biases, all_ids)
901 # true_w shape is [batch_size * num_true, dim]
902 # true_b is a [batch_size * num_true] tensor
903 true_w = array_ops.slice(
904 all_w, [0, 0], array_ops.pack([array_ops.shape(labels_flat)[0], -1]))
905 true_b = array_ops.slice(all_b, [0], array_ops.shape(labels_flat))
906
907 # inputs shape is [batch_size, dim]
908 # true_w shape is [batch_size * num_true, dim]
909 # row_wise_dots is [batch_size, num_true, dim]
910 dim = array_ops.shape(true_w)[1:2]
911 new_true_w_shape = array_ops.concat(0, [[-1, num_true], dim])
912 row_wise_dots = math_ops.mul(
913 array_ops.expand_dims(inputs, 1),
914 array_ops.reshape(true_w, new_true_w_shape))
915 # We want the row-wise dot plus biases which yields a
916 # [batch_size, num_true] tensor of true_logits.
917 dots_as_matrix = array_ops.reshape(row_wise_dots,
918 array_ops.concat(0, [[-1], dim]))
919 true_logits = array_ops.reshape(_sum_rows(dots_as_matrix), [-1, num_true])
920 true_b = array_ops.reshape(true_b, [-1, num_true])
921 true_logits += true_b
922
923 # Lookup weights and biases for sampled labels.
924 # sampled_w shape is [num_sampled, dim]
925 # sampled_b is a [num_sampled] float tensor
926 sampled_w = array_ops.slice(
927 all_w, array_ops.pack([array_ops.shape(labels_flat)[0], 0]), [-1, -1])
928 sampled_b = array_ops.slice(all_b, array_ops.shape(labels_flat), [-1])
929
930 # inputs has shape [batch_size, dim]
931 # sampled_w has shape [num_sampled, dim]
932 # sampled_b has shape [num_sampled]
933 # Apply X*W'+B, which yields [batch_size, num_sampled]
934 sampled_logits = math_ops.matmul(inputs,
935 sampled_w,
936 transpose_b=True) + sampled_b
937
938 if remove_accidental_hits:
939 acc_hits = candidate_sampling_ops.compute_accidental_hits(
940 labels, sampled, num_true=num_true)
941 acc_indices, acc_ids, acc_weights = acc_hits
942
943 # This is how SparseToDense expects the indices.
944 acc_indices_2d = array_ops.reshape(acc_indices, [-1, 1])
945 acc_ids_2d_int32 = array_ops.reshape(math_ops.cast(
946 acc_ids, dtypes.int32), [-1, 1])
947 sparse_indices = array_ops.concat(
948 1, [acc_indices_2d, acc_ids_2d_int32], "sparse_indices")
949 # Create sampled_logits_shape = [batch_size, num_sampled]
950 sampled_logits_shape = array_ops.concat(
951 0,
952 [array_ops.shape(labels)[:1], array_ops.expand_dims(num_sampled, 0)])
953 if sampled_logits.dtype != acc_weights.dtype:
954 acc_weights = math_ops.cast(acc_weights, sampled_logits.dtype)
955 sampled_logits += sparse_ops.sparse_to_dense(
956 sparse_indices, sampled_logits_shape, acc_weights,
957 default_value=0.0, validate_indices=False)
958
959 if subtract_log_q:
960 # Subtract log of Q(l), prior probability that l appears in sampled.
961 true_logits -= math_ops.log(true_expected_count)
962 sampled_logits -= math_ops.log(sampled_expected_count)
963
964 # Construct output logits and labels. The true labels/logits start at col 0.
965 out_logits = array_ops.concat(1, [true_logits, sampled_logits])
966 # true_logits is a float tensor, ones_like(true_logits) is a float tensor
967 # of ones. We then divide by num_true to ensure the per-example labels sum
968 # to 1.0, i.e. form a proper probability distribution.
969 out_labels = array_ops.concat(
970 1, [array_ops.ones_like(true_logits) / num_true,
971 array_ops.zeros_like(sampled_logits)])
972
973 return out_logits, out_labels
974
975
976 def nce_loss(weights, biases, inputs, labels, num_sampled, num_classes,
977 num_true=1,
978 sampled_values=None,
979 remove_accidental_hits=False,
980 partition_strategy="mod",
981 name="nce_loss"):
982 """Computes and returns the noise-contrastive estimation training loss.
983
984 See [Noise-contrastive estimation: A new estimation principle for
985 unnormalized statistical models]
986 (http://www.jmlr.org/proceedings/papers/v9/gutmann10a/gutmann10a.pdf).
987 Also see our [Candidate Sampling Algorithms Reference]
988 (../../extras/candidate_sampling.pdf)
989
990 Note: In the case where `num_true` > 1, we assign to each target class
991 the target probability 1 / `num_true` so that the target probabilities
992 sum to 1 per-example.
993
994 Note: It would be useful to allow a variable number of target classes per
995 example. We hope to provide this functionality in a future release.
996 For now, if you have a variable number of target classes, you can pad them
997 out to a constant number by either repeating them or by padding
998 with an otherwise unused class.
999
1000 Args:
1001 weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
1002 objects whose concatenation along dimension 0 has shape
1003 [num_classes, dim]. The (possibly-partitioned) class embeddings.
1004 biases: A `Tensor` of shape `[num_classes]`. The class biases.
1005 inputs: A `Tensor` of shape `[batch_size, dim]`. The forward
1006 activations of the input network.
1007 labels: A `Tensor` of type `int64` and shape `[batch_size,
1008 num_true]`. The target classes.
1009 num_sampled: An `int`. The number of classes to randomly sample per batch.
1010 num_classes: An `int`. The number of possible classes.
1011 num_true: An `int`. The number of target classes per training example.
1012 sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,
1013 `sampled_expected_count`) returned by a `*_candidate_sampler` function.
1014 (if None, we default to `log_uniform_candidate_sampler`)
1015 remove_accidental_hits: A `bool`. Whether to remove "accidental hits"
1016 where a sampled class equals one of the target classes. If set to
1017 `True`, this is a "Sampled Logistic" loss instead of NCE, and we are
1018 learning to generate log-odds instead of log probabilities. See
1019 our [Candidate Sampling Algorithms Reference]
1020 (../../extras/candidate_sampling.pdf).
1021 Default is False.
1022 partition_strategy: A string specifying the partitioning strategy, relevant
1023 if `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.
1024 Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.
1025 name: A name for the operation (optional).
1026
1027 Returns:
1028 A `batch_size` 1-D tensor of per-example NCE losses.
1029 """
1030 logits, labels = _compute_sampled_logits(
1031 weights, biases, inputs, labels, num_sampled, num_classes,
1032 num_true=num_true,
1033 sampled_values=sampled_values,
1034 subtract_log_q=True,
1035 remove_accidental_hits=remove_accidental_hits,
1036 partition_strategy=partition_strategy,
1037 name=name)
1038 sampled_losses = sigmoid_cross_entropy_with_logits(logits,
1039 labels,
1040 name="sampled_losses")
1041 # sampled_losses is batch_size x {true_loss, sampled_losses...}
1042 # We sum out true and sampled losses.
1043 return _sum_rows(sampled_losses)
1044
1045
1046 def sampled_softmax_loss(weights, biases, inputs, labels, num_sampled,
1047 num_classes, num_true=1,
1048 sampled_values=None,
1049 remove_accidental_hits=True,
1050 partition_strategy="mod",
1051 name="sampled_softmax_loss"):
1052 """Computes and returns the sampled softmax training loss.
1053
1054 This is a faster way to train a softmax classifier over a huge number of
1055 classes.
1056
1057 This operation is for training only. It is generally an underestimate of
1058 the full softmax loss.
1059
1060 At inference time, you can compute full softmax probabilities with the
1061 expression `tf.nn.softmax(tf.matmul(inputs, weights) + biases)`.
1062
1063 See our [Candidate Sampling Algorithms Reference]
1064 (../../extras/candidate_sampling.pdf)
1065
1066 Also see Section 3 of [Jean et al., 2014](http://arxiv.org/abs/1412.2007)
1067 ([pdf](http://arxiv.org/pdf/1412.2007.pdf)) for the math.
1068
1069 Args:
1070 weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
1071 objects whose concatenation along dimension 0 has shape
1072 [num_classes, dim]. The (possibly-sharded) class embeddings.
1073 biases: A `Tensor` of shape `[num_classes]`. The class biases.
1074 inputs: A `Tensor` of shape `[batch_size, dim]`. The forward
1075 activations of the input network.
1076 labels: A `Tensor` of type `int64` and shape `[batch_size,
1077 num_true]`. The target classes. Note that this format differs from
1078 the `labels` argument of `nn.softmax_cross_entropy_with_logits`.
1079 num_sampled: An `int`. The number of classes to randomly sample per batch.
1080 num_classes: An `int`. The number of possible classes.
1081 num_true: An `int`. The number of target classes per training example.
1082 sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,
1083 `sampled_expected_count`) returned by a `*_candidate_sampler` function.
1084 (if None, we default to `log_uniform_candidate_sampler`)
1085 remove_accidental_hits: A `bool`. whether to remove "accidental hits"
1086 where a sampled class equals one of the target classes. Default is
1087 True.
1088 partition_strategy: A string specifying the partitioning strategy, relevant
1089 if `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.
1090 Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.
1091 name: A name for the operation (optional).
1092
1093 Returns:
1094 A `batch_size` 1-D tensor of per-example sampled softmax losses.
logits, labels = _compute_sampled_logits(
weights, biases, inputs, labels, num_sampled, num_classes,
num_true=num_true,
sampled_values=sampled_values,
subtract_log_q=True,
remove_accidental_hits=remove_accidental_hits,
partition_strategy=partition_strategy,
name=name)
sampled_losses = nn_ops.softmax_cross_entropy_with_logits(logits, labels)
# sampled_losses is a [batch_size] tensor.
return sampled_losses
# TODO(cwhipkey): sigmoid and tanh should not be exposed from tf.nn.
__all__ = make_all(__name__)
__all__.append("zero_fraction") # documented in training.py
# Modules whitelisted for reference through tf.nn.
# TODO(cwhipkey): migrate callers to use the submodule directly.
__all__.extend(["nn_ops", "rnn_cell", "seq2seq"])
# Symbols whitelisted for export without documentation.
# TODO(cwhipkey): review these and move to contrib or expose through
# documentation.
__all__.extend([
"all_candidate_sampler",
"batch_norm_with_global_normalization",
"batch_normalization",
"bidirectional_rnn",
"conv2d_backprop_filter",
"conv2d_backprop_input",
"depthwise_conv2d_native",
"dynamic_rnn",
"lrn",
"relu_layer",
"rnn",
"state_saving_rnn",
"xw_plus_b",
])