Gradient accumulation in an RNN
问题 I ran into some memory issues (GPU) when running a large RNN network, but I want to keep my batch size reasonable so I wanted to try out gradient accumulation. In a network where you predict the output in one go, that seems self-evident but in an RNN you do multiple forward passes for each input step. Because of that, I fear that my implementation does not work as intended. I started from user albanD's excellent examples here , but I think they should be modified when using an RNN. The reason