State dependent action set in reinforcement learning

问题

How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.

I'm also interested in see if the solutions would be different if the legal actions were overlapping.

For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)

For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?

回答1:

Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?

Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.

Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.

As you see, these solutions don't change or differ when the actions are 'overlapping'.

Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).

But is the learning of these rules hard?

You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?

回答2:

Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.

In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.

回答3:

I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.

In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.

So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.

来源：https://stackoverflow.com/questions/50012295/state-dependent-action-set-in-reinforcement-learning

标签

machine-learning

reinforcement-learning

q-learning