Variable-length tensors in Theano

问题

This question refers to best practices in Theano. Here is what I am trying to do:

I am building a neural network for an SMT system. In this context, I conceptually represent sentences as variable-length lists of words, and words as fixed-length lists of integers. Ideally, I would like to represent my corpus as a 3D tensor (first dimension = sentences in corpus, second dimension = words in sentence, third dimension = integer features in words). The difficulty is that sentences have variable length and, to my knowledge, tensors in Theano have the strict requirement that all lengths in one dimension must be the same.

Solutions I have thought of include:

Use padding with dummy words so that sentences become equally sized. But this means that whenever I iterate over a sentence, I need to include special code to discard the padding.
Represent the corpus as a vector of matrices. However, this makes it hard to work with certain functions. For instance, if I want to add up the representations of all the words in a sentence, I can't simply use *corpus.sum(axis=1)*. I would have to loop over sentences, do *sentence.sum(axis=0)*, and then gather the results into another tensor.

My question is: which of these alternatives are preferred, or is there a better one?

回答1:

The first option is probably the best option in most cases. It's what I do though it does mean passing around a separate vector of sentence lengths and masking certain results to eliminate the padding region when needed.

In general, if you want to perform a consistent operation to all sentences then you'll usually get much better speed applying that operation to a single 3D tensor than sequentially to a series of matrices. This is especially true for operations running on a GPU.

If you're using scan operations the speed differences will become even more magnified. You'll be better off scanning over a 3D tensor and operating on a per-word matrix in your step function that covers all (or a minibatch of) sentences. If needed, you may need to know which rows of that matrix are real data and which are padding. As an aside, I find that setting the first dimension of a 3D tensor to be the temporal/sequence position dimension helps when using scan, which always scans over the first dimension.

Often, using the value zero as your padding value will result in the padding have no impact on your operations.

The other option, looping over the sentences, would mean mixing Theano and Python code which can make some computations difficult or impossible. For example, getting the gradient of a cost function with respect to some parameters over a all (or batch) of your sentences may not be possible if the data is stored in lots of separate matrices.

来源：https://stackoverflow.com/questions/24205187/variable-length-tensors-in-theano

标签

theano