0%

We develop a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model, such as Monte Carlo and temporal-difference methods. These are respectively called model-based and model-free reinforcement learning methods. Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning. Although there are real di↵erences between these two kinds of methods, there are also great similarities.

• All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy
• They compute value functions by updates or backup operations applied to simulated experience.

## Model

• By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions.
• Distribution model produce a description of all possibilities and their probabilities. Sample model produce just one of the possibilities and their probabilities.
• 当给定一个 state 和一个 action 时，distribution model 可以生成所有可能的状态转移，而sample model只能给出一个可能的状态转移
• 当给定一个 state 和 Policy 时，distribution model 可以获得所有可能的 episode 并得到他们出现的概率，但 sample model 只能给出一个 episode

## Planning

• 通过计算values function 来进行Policy 提升
• 根据simulated experience来计算value function

Planning（如DP） 和learning（如MC、TD）方法的核心都是用backing-up 更新公式计算value function 的估计值。区别在于Planning 所用经验是有模型生成的simulated exprience，而learning method使用的经验是由真实环境生成的real exprience。 但两者都满足上述state space Planning结构，这表示很多思想和算法可以相互借鉴，在应用中常常用 learning 中 value function 估计值的更新公式取代 Planning 中的 value function 估计值的更新公式。例如，我们可以将Q learning 和 planning 结合，得到random-sample one-step tabular Q-planning 方法：

one-step tabular Q-learning最终会收敛到一个对应于真实环境的optimal Policy，而 random-sample one-step tabular Q-planning 则收敛到一个对应于model 的optimal Policy。