I\'m working on creating a 2 layer neural network with back-propagation. The NN is supposed to get its data from a 20001x17 vector that holds following information in each row:<
You can think of y2 as an output probability distribution for each input being one of the 26 alphabet characters, for example if one column of y2 says:
.2
.5
.15
.15
then its 50% probability that this character is B (if we assume only 4 possible outputs).
==REMARK==
The output layer of the NN consists of 26 outputs. Every time the NN is fed an input like the one described above it's supposed to output a 1x26 vector containing zeros in all but the one cell that corresponds to the letter that the input values were meant to represent. for example the output [1 0 0 ... 0] would be letter A, whereas [0 0 0 ... 1] would be the letter Z.
It is preferable to avoid using target values of 0,1 to encode the output of the network.
The reason for avoiding target values of 0 and 1 is that 'logsig' sigmoid transfer function cannot produce these output values given finite weights. If you attempt to train the network to fit target values of exactly 0 and 1, gradient descent will force the weights to grow without bound.
So instead of 0 and 1 values, try using values of 0.04 and 0.9 for example, so that [0.9,0.04,...,0.04] is the target output vector for the letter A.
Reference:
Thomas M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997, p114-115
hardlin fcn in output layer.
trainlm or trainrp for training the network.mapminmax for pre-processing data set.I don't know if this constitutes an actual answer or not: but here are some remarks.
See David Mackay's textbook for a nice intro to neural nets which will make clear the probabilistic connection. Take a look at this paper from Geoff Hinton's group which describes the task of predicting the next character given the context for details on the correct representation and activation/activity functions (although beware their method is non-trivial and uses a recurrent net with a different training method).
This is normal. Your output layer is using a log-sigmoid transfer function, and that will always give you some intermediate output between 0 and 1.
What you would usually do would be to look for the output with the largest value -- in other words, the most likely character.
This would mean that, for every column in y2, you're looking for the index of the row that contains the largest value in that row. You can compute this as follows:
[dummy, I]=max(y2);
I is then a vector containing the indexes of the largest value in each row.