:) I try to implement self play with PPO. Suppose we have a game with 2 agents. We control one player on each side and get information like observation and reward after each