I have a synthetic dataset that contains multiple word answers. I simply use a GRU to produce the answers. The structure of the dataset is as follows:
1 John