Planning and learning with Tabular Methods

We develop a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model, such as Monte Carlo and temporal-difference methods. These are respectively called model-based and model-free reinforcement learning methods. Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning. Although there are real di↵erences between these two kinds of methods, there are also great similarities.

  • All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy
  • They compute value functions by updates or backup operations applied to simulated experience.


  • By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions.
  • Distribution model produce a description of all possibilities and their probabilities. Sample model produce just one of the possibilities and their probabilities.
  • 当给定一个 state 和一个 action 时,distribution model 可以生成所有可能的状态转移,而sample model只能给出一个可能的状态转移
  • 当给定一个 state 和 Policy 时,distribution model 可以获得所有可能的 episode 并得到他们出现的概率,但 sample model 只能给出一个 episode

总之,distribution model 比 sample model包含更多信息,但现实中往往更容易获得sample model。简单来说,distribution model 包含了所有状态的转移概率,但sample model更像是管中窥豹,可见一斑。在DP中,我们用到的是distribution model,而在MC中我们用到的是sample model。model 是对环境的一种表达方式,(不一定是真实或完全正确的),可以用来产生仿真经验(simulation experience)。


从Model中生成或提升Policy 的计算过程称为 Planning:

注意本文讨论的Planning都是state space Planning,这种Planning有两个特点:

  • 通过计算values function 来进行Policy 提升
  • 根据simulated experience来计算value function

Planning(如DP) 和learning(如MC、TD)方法的核心都是用backing-up 更新公式计算value function 的估计值。区别在于Planning 所用经验是有模型生成的simulated exprience,而learning method使用的经验是由真实环境生成的real exprience。 但两者都满足上述state space Planning结构,这表示很多思想和算法可以相互借鉴,在应用中常常用 learning 中 value function 估计值的更新公式取代 Planning 中的 value function 估计值的更新公式。例如,我们可以将Q learning 和 planning 结合,得到random-sample one-step tabular Q-planning 方法:

one-step tabular Q-learning最终会收敛到一个对应于真实环境的optimal Policy,而 random-sample one-step tabular Q-planning 则收敛到一个对应于model 的optimal Policy。