Planning and learning with Tabular Methods

We develop a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model, such as Monte Carlo and temporal-difference methods. These are respectively called model-based and model-free reinforcement learning methods. Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning. Although there are real di↵erences between these two kinds of methods, there are also great similarities.

All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy
They compute value functions by updates or backup operations applied to simulated experience.

Model

By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions.
Distribution model produce a description of all possibilities and their probabilities. Sample model produce just one of the possibilities and their probabilities.
当给定一个 state 和一个 action 时，distribution model 可以生成所有可能的状态转移，而sample model只能给出一个可能的状态转移
当给定一个 state 和 Policy 时，distribution model 可以获得所有可能的 episode 并得到他们出现的概率，但 sample model 只能给出一个 episode

总之，distribution model 比 sample model包含更多信息，但现实中往往更容易获得sample model。简单来说，distribution model 包含了所有状态的转移概率，但sample model更像是管中窥豹，可见一斑。在DP中，我们用到的是distribution model，而在MC中我们用到的是sample model。model 是对环境的一种表达方式，（不一定是真实或完全正确的），可以用来产生仿真经验（simulation experience）。

Planning

从Model中生成或提升Policy 的计算过程称为 Planning:

注意本文讨论的Planning都是state space Planning，这种Planning有两个特点：

通过计算values function 来进行Policy 提升
根据simulated experience来计算value function

Planning（如DP）和learning（如MC、TD）方法的核心都是用backing-up 更新公式计算value function 的估计值。区别在于Planning 所用经验是有模型生成的simulated exprience，而learning method使用的经验是由真实环境生成的real exprience。但两者都满足上述state space Planning结构，这表示很多思想和算法可以相互借鉴，在应用中常常用 learning 中 value function 估计值的更新公式取代 Planning 中的 value function 估计值的更新公式。例如，我们可以将Q learning 和 planning 结合，得到random-sample one-step tabular Q-planning 方法：

one-step tabular Q-learning最终会收敛到一个对应于真实环境的optimal Policy，而 random-sample one-step tabular Q-planning 则收敛到一个对应于model 的optimal Policy。