I don't know if this constitutes an actual answer or not: but here are some remarks.
- I don't understand your coding scheme. How is an 'A' represented as that set of numbers? It looks like you're falling into a fairly common trap of using arbitrary numbers to code categorical values. Don't do this: for example if 'a' is 1, 'b' is 2 and 'c' is 3, then your coding has implicitly stated that 'a' is more like 'b' than 'c' (because the network has real-value inputs the ordinal properties matter). The way to do this properly is to have each letter represented as 26 binary valued inputs, where only one is ever active, representing the letter.
- Your outputs are correct, the activation at the output layer will not
ever be either 0 or 1, but real numbers. You could take the max as
your activity function, but this is problematic because it's not
differentiable, so you can't use back-prop. What you should do is
couple the outputs with the softmax function, so that their sum
is one. You can then treat the outputs as conditional probabilities
given the inputs, if you so desire. While the network is not
explicitly probabilistic, with the correct activity and activation
functions is will be identical in structure to a log-linear model
(possibly with latent variables corresponding to the hidden layer),
and people do this all the time.
See David Mackay's textbook for a nice intro to neural nets which will make clear the probabilistic connection. Take a look at this paper from Geoff Hinton's group which describes the task of predicting the next character given the context for details on the correct representation and activation/activity functions (although beware their method is non-trivial and uses a recurrent net with a different training method).