What is the difference between value iteration and policy iteration?
问题 In reinforcement learning, what is the difference between policy iteration and value iteration ? As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy. My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies. 回答1: Let's look at them side by side