How does Monte Carlo Search Tree work?

问题

Trying to learn MCST using YouTube videos and papers like this one.

http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/grand-challenge.pdf

However I am not having much of a luck understanding the details beyond the high level theoretical explanations. Here are some quotes from the paper above and questions I have.

Selection Phase: MCTS iteratively selects the highest scoring child node of the current state. If the current state is the root node, where did these children come from in the first place? Wouldn't you have a tree with just a single root node to begin with? With just a single root node, do you get into Expansion and Simulation phase right away?
If MCTS selects the highest scoring child node in Selection phase, you never explore other children or possibly even a brand new child whilst going down the levels of the tree?
How does the Expansion phase happen for a node? In the diagram above, why did it not choose leaf node but decided to add a sibling to the leaf node?
During the Simulation phase, stochastic policy is used to select legal moves for both players until the game terminates. Is this stochastic policy a hard-coded behavior and you are basically rolling a dice in the simulation to choose one of the possible moves taking turns between each player until the end?
The way I understand this is you start at a single root node and by repeating the above phases you construct the tree to a certain depth. Then you choose the child with the best score at the second level as your next move. The size of the tree you are willing to construct is basically your hard AI responsiveness requirement right? Since while the tree is being constructed the game will stall and compute this tree.

回答1:

Selection Phase: MCTS iteratively selects the highest scoring child node of the current state. If the current state is the root node, where did these children come from in the first place? Wouldn't you have a tree with just a single root node to begin with? With just a single root node, do you get into Expansion and Simulation phase right away?

The selection step is typically implemented not to actually choose among nodes which really exist in the tree (having been created through the Expansion step). It is typically ipmlemented to choose among all possible successor states of the game state matching your current node.

So, at the very beginning, when you have just a root node, you'll want your Selection step to still be able to select one out of all the possible successor game states (even if they don't have matching nodes in the tree yet). Typically you'll want a very high score (infinite, or some very large constant) for game states which have never been visited yet (which don't have nodes in the tree yet). This way, your Selection Step will always randomly select among any states that don't have a matching node yet, and only really use the exploration vs. exploitation trade-off in cases where all possible game states already have a matching node in the tree.

If MCTS selects the highest scoring child node in Selection phase, you never explore other children or possibly even a brand new child whilst going down the levels of the tree?

The ''score'' used by the Selection step should typically not just be the average of all outcomes of simulations going through that node. It should typically be a score consisting of two parts; an "exploration" part, which is high for nodes that have been visited relatively infrequently, and an "exploitation" part, which is high for nodes which appear to be good moves so far (where many simulations going through that node ended in a win for the player who's allowed to choose a move to make). This is described in Section 3.4 of the paper you linked. The W(s, a) / N(s, a) is the exploitation part (simply average score), and the B(s, a) is the exploration part.

How does the Expansion phase happen for a node? In the diagram above, why did it not choose leaf node but decided to add a sibling to the leaf node?

The Expansion step is typically implemented to simply add a node corresponding to the final game state selected by the Selection Step (following what I answered to your first question, the Selection Step will always end in selecting one game state that has never been selected before).

During the Simulation phase, stochastic policy is used to select legal moves for both players until the game terminates. Is this stochastic policy a hard-coded behavior and you are basically rolling a dice in the simulation to choose one of the possible moves taking turns between each player until the end?

The most straightforward (and probably most common) implementation is indeed to play completely at random. It is also possible to do this differently though. You could for example use heuristics to create a bias towards certain actions. Typically, completely random play is faster, allowing you to run more simulations in the same amount of processing time. However, it typically also means every individual simulation is less informative, meaning you actually need to run more simulations for MCTS to play well.

The way I understand this is you start at a single root node and by repeating the above phases you construct the tree to a certain depth. Then you choose the child with the best score at the second level as your next move. The size of the tree you are willing to construct is basically your hard AI responsiveness requirement right? Since while the tree is being constructed the game will stall and compute this tree.

MCTS does not uniformly explore all parts of the tree to the same depth. It has a tendency to explore parts which appear to be interesting (strong moves) deeper than parts which appear to be uninteresting (weak moves). So, typically you wouldn't really use a depth limit. Instead, you would use a time limit (for example, keep running iterations until you've spent 1 second, or 5 seconds, or 1 minute, or whatever amount of processing time you allow), or an iteration count limit (for example, allow it to run 10K or 50K or any number of simulations you like).

回答2:

Basically, Monte Carlo is : try randomly many times(*) and then keep the move that led to the best outcome most of the times.

(*) : the number of times and the depth depends on the speed of the decision you want to acheive.

So the root node is always the current game state with immediate children being your possible moves. If you can do 2 moves (yes/no, left/right,...) then you have 2 sub-nodes.

If you cannot do any moves (it may happen depending on the game) then you do not have any decision to make, then Montec Carlo is useless for this move.

If you have X possible moves (chess game) then each possible move is a direct child node.

Then, (in a 2 player game), evey level is alternating "your moves", "opponent moves" and so on.

How to traverse the tree should be random (uniform).

Your move 1 (random move of sub-level 1)

His move 4 (random move of sub-level 2)

Your move 3 (random move of sub-level 3) -> win yay

Pick a reference maximum depth and evaluate how many times you win or lose (or have a sot of evaluation function if the game is not finished after X depth).

You repeat the operation Y times (being quite large) and you select the immediate child node (aka: your move) that leads to you winning most of the times.

This is to evaluate which move you should do now. After this, the opponent moves and it is your turn again. So you have to re-create a tree with the root node being the new current situation and redo the Monte Carlo technique to guess what is your best possible move. And so on.

来源：https://stackoverflow.com/questions/44230911/how-does-monte-carlo-search-tree-work

标签

montecarlo

monte-carlo-tree-search