Welcome to Course 4, Programming Assignment 2! We have learned about reinforcement learning algorithms for prediction and control in previous courses and extended those algorithms to large state spaces using function approximation. One example of this was in assignment 2 of course 3 where we implemented semigradient TD for prediction and used a neural network as the function approximator. In this notebook, we will build a reinforcement learning agent for control, again using a neural network for function approximation. This combination of neural network function approximators and reinforcement learning algorithms, often referred to as Deep RL, is an active area of research and has led to many impressive results (e. g., AlphaGo: https://deepmind.com/research/casestudies/alphagothestorysofar).
In this assignment, you will:
1  # Do not modify this cell! 
This section includes the function approximator that we use in our agent, a neural network. In Course 3 Assignment 2, we used a neural network as the function approximator for a policy evaluation problem. In this assignment, we will use a neural network for approximating the actionvalue function in a control problem. The main difference between approximating a statevalue function and an actionvalue function using a neural network is that in the former the output layer only includes one unit whereas in the latter the output layer includes as many units as the number of actions.
In the cell below, you will specify the architecture of the actionvalue neural network. More specifically, you will specify self.layer_size
in the __init__()
function.
We have already provided get_action_values()
and get_TD_update()
methods. The former computes the actionvalue function by doing a forward pass and the latter computes the gradient of the actionvalue function with respect to the weights times the TD error. These get_action_values()
and get_TD_update()
methods are similar to the get_value()
and get_gradient()
methods that you implemented in Course 3 Assignment 2. The main difference is that in this notebook, they are designed to be applied to batches of states instead of one state. You will later use these functions for implementing the agent.
1  # Work Required: Yes. Fill in the code for layer_sizes in __init__ (~1 Line). 
Run the cell below to test your implementation of the __init__()
function for ActionValueNetwork:
1  # Do not modify this cell! 
layer_sizes: [5, 20, 3]Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Expected output:
layer_sizes: [ 5 20 3]
In this assignment, you will use the Adam algorithm for updating the weights of your actionvalue network. As you may remember from Course 3 Assignment 2, the Adam algorithm is a more advanced variant of stochastic gradient descent (SGD). The Adam algorithm improves the SGD update with two concepts: adaptive vector stepsizes and momentum. It keeps running estimates of the mean and second moment of the updates, denoted by $\mathbf{m}$ and $\mathbf{v}$ respectively:
Here, $\beta_m$ and $\beta_v$ are fixed parameters controlling the linear combinations above and $g_t$ is the update at time $t$ (generally the gradients, but here the TD error times the gradients).
Given that $\mathbf{m}$ and $\mathbf{v}$ are initialized to zero, they are biased toward zero. To get unbiased estimates of the mean and second moment, Adam defines $\mathbf{\hat{m}}$ and $\mathbf{\hat{v}}$ as:
The weights are then updated as follows:
Here, $\alpha$ is the step size parameter and $\epsilon$ is another small parameter to keep the denominator from being zero.
In the cell below, you will implement the __init__()
and update_weights()
methods for the Adam algorithm. In __init__()
, you will initialize self.m
and self.v
. In update_weights()
, you will compute new weights given the input weights and an update $g$ (here td_errors_times_gradients
) according to the equations above.
1  ### Work Required: Yes. Fill in code in __init__ and update_weights (~911 Lines). 
Run the following code to test your implementation of the __init__()
function:
1  # Do not modify this cell! 
m[0]["W"] shape: (5, 2)m[0]["b"] shape: (1, 2)m[1]["W"] shape: (2, 3)m[1]["b"] shape: (1, 3) v[0]["W"] shape: (5, 2)v[0]["b"] shape: (1, 2)v[1]["W"] shape: (2, 3)v[1]["b"] shape: (1, 3) Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Expected output:
m[0]["W"] shape: (5, 2)m[0]["b"] shape: (1, 2)m[1]["W"] shape: (2, 3)m[1]["b"] shape: (1, 3) v[0]["W"] shape: (5, 2)v[0]["b"] shape: (1, 2)v[1]["W"] shape: (2, 3)v[1]["b"] shape: (1, 3)
Run the following code to test your implementation of the update_weights()
function:
1  # Do not modify this cell! 
updated_weights[0]["W"] [[1.03112528 2.08618453] [0.15531623 0.02412129] [0.76656476 0.65405898] [0.92569612 0.24916335] [0.92180119 0.72137957]] updated_weights[0]["b"] [[0.44392532 0.69588495]] updated_weights[1]["W"] [[ 0.13962892 0.48820826 0.41311548] [ 0.3958054 0.20738072 0.47172585]] updated_weights[1]["b"] [[0.48917533 0.61934122 1.48771198]] Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Expected output:
updated_weights[0]["W"] [[1.03112528 2.08618453] [0.15531623 0.02412129] [0.76656476 0.65405898] [0.92569612 0.24916335] [0.92180119 0.72137957]] updated_weights[0]["b"] [[0.44392532 0.69588495]] updated_weights[1]["W"] [[ 0.13962892 0.48820826 0.41311548] [ 0.3958054 0.20738072 0.47172585]] updated_weights[1]["b"] [[0.48917533 0.61934122 1.48771198]]
In Course 3, you implemented agents that update value functions once for each sample. We can use a more efficient approach for updating value functions. You have seen an example of an efficient approach in Course 2 when implementing Dyna. The idea behind Dyna is to learn a model using sampled experience, obtain simulated experience from the model, and improve the value function using the simulated experience.
Experience replay is a simple method that can get some of the advantages of Dyna by saving a buffer of experience and using the data stored in the buffer as a model. This view of prior data as a model works because the data represents actual transitions from the underlying MDP. Furthermore, as a side note, this kind of model that is not learned and simply a collection of experience can be called nonparametric as it can be evergrowing as opposed to a parametric model where the transitions are learned to be represented with a fixed set of parameters or weights.
We have provided the implementation of the experience replay buffer in the cell below. ReplayBuffer includes two main functions: append()
and sample()
. append()
adds an experience transition to the buffer as an array that includes the state, action, reward, terminal flag (indicating termination of the episode), and next_state. sample()
gets a batch of experiences from the buffer with size minibatch_size
.
You will use the append()
and sample()
functions when implementing the agent.
1  # Do not modify this cell! 
In this assignment, you will use a softmax policy. One advantage of a softmax policy is that it explores according to the actionvalues, meaning that an action with a moderate value has a higher chance of getting selected compared to an action with a lower value. Contrast this with an $\epsilon$greedy policy which does not consider the individual action values when choosing an exploratory action in a state and instead chooses randomly when doing so.
The probability of selecting each action according to the softmax policy is shown below:
where $\tau$ is the temperature parameter which controls how much the agent focuses on the highest valued actions. The smaller the temperature, the more the agent selects the greedy action. Conversely, when the temperature is high, the agent selects among actions more uniformly random.
Given that a softmax policy exponentiates action values, if those values are large, exponentiating them could get very large. To implement the softmax policy in a numerically stable way, we often subtract the maximum actionvalue from the actionvalues. If we do so, the probability of selecting each action looks as follows:
In the cell below, you will implement the softmax()
function. In order to do so, you could break the above computation into smaller steps:
1  def softmax(action_values, tau=1.0): 
Run the cell below to test your implementation of the softmax()
function:
1  # Do not modify this cell! 
action_probs [[0.25849645 0.01689625 0.05374514 0.67086216] [0.84699852 0.00286345 0.13520063 0.01493741]]Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Expected output:
action_probs [[0.25849645 0.01689625 0.05374514 0.67086216] [0.84699852 0.00286345 0.13520063 0.01493741]]
In this section, you will combine components from the previous sections to write up an RLGlue Agent. The main component that you will implement is the actionvalue network updates with experience sampled from the experience replay buffer.
At time $t$, we have an actionvalue function represented as a neural network, say $Q_t$. We want to update our actionvalue function and get a new one we can use at the next timestep. We will get this $Q_{t+1}$ using multiple replay steps that each result in an intermediate actionvalue function $Q_{t+1}^{i}$ where $i$ indexes which replay step we are at.
In each replay step, we sample a batch of experiences from the replay buffer and compute a minibatch ExpectedSARSA update. Across these N replay steps, we will use the current “unupdated” actionvalue network at time $t$, $Q_t$, for computing the actionvalues of the nextstates. This contrasts using the most recent actionvalues from the last replay step $Q_{t+1}^{i}$. We make this choice to have targets that are stable across replay steps. Here is the pseudocode for performing the updates:
As you can see in the pseudocode, after sampling a batch of experiences, we do many computations. The basic idea however is that we are looking to compute a form of a TD error. In order to so, we can take the following steps:
For the third step above, you can start by computing $\pi(b  s’) Q_t(s’, b)$ followed by summation to get $\hat{v}_\pi(s’) = \left(\sum_{b} \pi(b  s’) Q_t(s’, b)\right)$. $\hat{v}_\pi(s’)$ is an estimate of the value of the next state. Note for terminal next states, $\hat{v}_\pi(s’) = 0$. Finally, we add the rewards to the discount times $\hat{v}_\pi(s’)$.
You will implement these steps in the get_td_error()
function below which given a batch of experiences (including states, next_states, actions, rewards, terminals), fixed actionvalue network (current_q), and actionvalue network (network), computes the TD error in the form of a 1D array of size batch_size.
1  ### Work Required: Yes. Fill in code in get_td_error (~9 Lines). 
Run the following code to test your implementation of the get_td_error()
function:
1  # Do not modify this cell! 
Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Now that you implemented the get_td_error()
function, you can use it to implement the optimize_network()
function. In this function, you will:
get_td_error()
,get_TD_update()
function of network to calculate the gradients times TD errors, and,1  ### Work Required: Yes. Fill in code in optimize_network (~2 Lines). 
Run the following code to test your implementation of the optimize_network()
function:
1  # Do not modify this cell! 
Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Now that you implemented the optimize_network()
function, you can implement the agent. In the cell below, you will fill the agent_step()
and agent_end()
functions. You should:
agent_step()
),optimize_network()
function that you implemented above.1  ### Work Required: Yes. Fill in code in agent_step and agent_end (~7 Lines). 
Run the following code to test your implementation of the agent_step()
function:
1  # Do not modify this cell! 
Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Run the following code to test your implementation of the agent_end()
function:
1  # Do not modify this cell! 
Passed the asserts! (Note: These are however limited in scope, additional testing is encouraged.)
Now that you implemented the agent, we can use it to run an experiment on the Lunar Lander problem. We will plot the learning curve of the agent to visualize learning progress. To plot the learning curve, we use the sum of rewards in an episode as the performance measure. We have provided for you the experiment/plot code in the cell below which you can go ahead and run. Note that running the cell below has taken approximately 10 minutes in prior testing.
1  def run_experiment(environment, agent, environment_parameters, agent_parameters, experiment_parameters): 
51%█████  152/300 [07:08<09:09, 3.71s/it]
Run the cell below to see the comparison between the agent that you implemented and a random agent for the one run and 300 episodes. Note that the plot_result()
function smoothes the learning curve by applying a sliding window on the performance measure.
1  plot_result(["expected_sarsa_agent", "random_agent"]) 
In the following cell you can visualize the performance of the agent with a correct implementation. As you can see, the agent initially crashes quite quickly (Episode 0). Then, the agent learns to avoid crashing by expending fuel and staying far above the ground. Finally however, it learns to land smoothly within the landing zone demarcated by the two flags (Episode 275).
In the learning curve above, you can see that sum of reward over episode has quite a highvariance at the beginning. However, the performance seems to be improving. The experiment that you ran was for 300 episodes and 1 run. To understand how the agent performs in the long run, we provide below the learning curve for the agent trained for 3000 episodes with performance averaged over 30 runs.
You can see that the agent learns a reasonably good policy within 3000 episodes, gaining sum of reward bigger than 200. Note that because of the highvariance in the agent performance, we also smoothed the learning curve.
You have successfully implemented Course 4 Programming Assignment 2.
You have implemented an Expected Sarsa agent with a neural network and the Adam optimizer and used it for solving the Lunar Lander problem! You implemented different components of the agent including:
You tested the agent for a single parameter setting. In the next assignment, you will perform a parameter study on the stepsize parameter to gain insight about the effect of stepsize on the performance of your agent.
Note: Apart from using the Submit
button in the notebook, you have to submit an additional zip file containing the ‘npy’ files that were generated from running the experiment cells. In order to do so:
File>Open
to open the directory view of this assignment. Select the checkbox next to results.zip
and click on Download.
Alternatively, you can download the results folder and run zip jr results.zip results/
(The flag ‘j’ is required by the grader!).These account for 25% of the marks, so don’t forget to do so!
]]>Welcome to your Course 3 Programming Assignment 4. In this assignment, you will implement Average Reward Softmax ActorCritic in the Pendulum SwingUp problem that you have seen earlier in the lecture. Through this assignment you will get handson experience in implementing actorcritic methods on a continuing task.
In this assignment, you will:
1. Implement softmax actorcritic agent on a continuing task using the average reward formulation.2. Understand how to parameterize the policy as a function to learn, in a discrete action environment.3. Understand how to (approximately) sample the gradient of this objective to update the actor.4. Understand how to update the critic using differential TD error.
In this assignment, we will be using a Pendulum environment, adapted from Santamaría et al. (1998). This is also the same environment that we used in the lecture. The diagram below illustrates the environment.
The environment consists of single pendulum that can swing 360 degrees. The pendulum is actuated by applying a torque on its pivot point. The goal is to get the pendulum to balance upright from its resting position (hanging down at the bottom with no velocity) and maintain it as long as possible. The pendulum can move freely, subject only to gravity and the action applied by the agent.
The state is 2dimensional, which consists of the current angle $\beta \in [\pi, \pi]$ (angle from the vertical upright position) and current angular velocity $\dot{\beta} \in (2\pi, 2\pi)$. The angular velocity is constrained in order to avoid damaging the pendulum system. If the angular velocity reaches this limit during simulation, the pendulum is reset to the resting position.
The action is the angular acceleration, with discrete values $a \in \{1, 0, 1\}$ applied to the pendulum.
For more details on environment dynamics you can refer to the original paper.
The goal is to swingup the pendulum and maintain its upright angle. Hence, the reward is the negative absolute angle from the vertical position: $R_{t} = \beta_{t}$
Furthermore, since the goal is to reach and maintain a vertical position, there are no terminations nor episodes. Thus this problem can be formulated as a continuing task.
Similar to the Mountain Car task, the action in this pendulum environment is not strong enough to move the pendulum directly to the desired position. The agent must learn to first move the pendulum away from its desired position and gain enough momentum to successfully swingup the pendulum. And even after reaching the upright position the agent must learn to continually balance the pendulum in this unstable position.
You will use the following packages in this assignment.
Please do not import other libraries — this will break the autograder.
1  # Do not modify this cell! 
In this section, we are going to build a tile coding class for our agent that will make it easier to make calls to our tile coder.
Tilecoding is introduced in Section 9.5.4 of the textbook as a way to create features that can both provide good generalization and discrimination. We have already used it in our last programming assignment as well.
Similar to the last programming assignment, we are going to make a function specific for tile coding for our Pendulum Swingup environment. We will also use the Tiles3 library.
To get the tile coder working we need to:
1) create an index hash table using tc.IHT(), 2) scale the inputs for the tile coder based on number of tiles and range of values each input could take3) call tc.tileswrap to get active tiles back.
However, we need to make one small change to this tile coder.
Note that in this environment the state space contains angle, which is between $[\pi, \pi]$. If we tilecode this state space in the usual way, the agent may think the value of states corresponding to an angle of $\pi$ is very different from angle of $\pi$ when in fact they are the same! To remedy this and allow generalization between angle $= \pi$ and angle $= \pi$, we need to use wrap tile coder.
The usage of wrap tile coder is almost identical to the original tile coder, except that we also need to provide the wrapwidth
argument for the dimension we want to wrap over (hence only for angle, and None
for angular velocity). More details of wrap tile coder is also provided in Tiles3 library.
1  #  
Run the following code to verify PendulumTilecoder
1  #  
Now that we implemented PendulumTileCoder let’s create the agent that interacts with the environment. We will implement the same average reward ActorCritic algorithm presented in the videos.
This agent has two components: an Actor and a Critic. The Actor learns a parameterized policy while the Critic learns a statevalue function. The environment has discrete actions; your Actor implementation will use a softmax policy with exponentiated actionpreferences. The Actor learns with the samplebased estimate for the gradient of the average reward objective. The Critic learns using the average reward version of the semigradient TD(0) algorithm.
In this section, you will be implementing agent_policy
, agent_start
, agent_step
, and agent_end
.
Let’s first define a couple of useful helper functions.
In this part you will implement compute_softmax_prob
.
This function computes softmax probability for all actions, given actor weights actor_w
and active tiles tiles
. This function will be later used in agent_policy
to sample appropriate action.
First, recall how the softmax policy is represented from stateaction preferences: $\large \pi(as, \mathbf{\theta}) \doteq \frac{e^{h(s,a,\mathbf{\theta})}}{\sum_{b}e^{h(s,b,\mathbf{\theta})}}$.
stateaction preference is defined as $h(s,a, \mathbf{\theta}) \doteq \mathbf{\theta}^T \mathbf{x}_h(s,a)$.
Given active tiles tiles
for state s
, stateaction preference $\mathbf{\theta}^T \mathbf{x}_h(s,a)$ can be computed by actor_w[a][tiles].sum()
.
We will also use expnormalize trick, in order to avoid possible numerical overflow.
Consider the following:
$\large \pi(as, \mathbf{\theta}) \doteq \frac{e^{h(s,a,\mathbf{\theta})}}{\sum_{b}e^{h(s,b,\mathbf{\theta})}} = \frac{e^{h(s,a,\mathbf{\theta})  c} e^c}{\sum_{b}e^{h(s,b,\mathbf{\theta})  c} e^c} = \frac{e^{h(s,a,\mathbf{\theta})  c}}{\sum_{b}e^{h(s,b,\mathbf{\theta})  c}}$
$\pi(\cdots, \mathbf{\theta})$ is shiftinvariant, and the policy remains the same when we subtract a constant $c \in \mathbb{R}$ from stateaction preferences.
Normally we use $c = \max_b h(s,b, \mathbf{\theta})$, to prevent any overflow due to exponentiating large numbers.
1  #  
Run the following code to verify compute_softmax_prob
.
We will test the method by building a softmax policy from stateaction preferences [1,1,2].
The sampling probability should then roughly match $[\frac{e^{1}}{e^{1}+e^1+e^2}, \frac{e^{1}}{e^{1}+e^1+e^2}, \frac{e^2}{e^{1}+e^1+e^2}] \approx$ [0.0351, 0.2595, 0.7054]
1  #  
softmax probability: [0.03511903 0.25949646 0.70538451]
Let’s first define methods that initialize the agent. agent_init()
initializes all the variables that the agent will need.
Now that we have implemented helper functions, let’s create an agent. In this part, you will implement agent_start()
and agent_step()
. We do not need to implement agent_end()
because there is no termination in our continuing task.
compute_softmax_prob()
is used in agent_policy()
, which in turn will be used in agent_start()
and agent_step()
. We have implemented agent_policy()
for you.
When performing updates to the Actor and Critic, recall their respective updates in the ActorCritic algorithm video.
We approximate $q_\pi$ in the Actor update using onestep bootstrapped return($R_{t+1}  \bar{R} + \hat{v}(S_{t+1}, \mathbf{w})$) subtracted by current statevalue($\hat{v}(S_{t}, \mathbf{w})$), equivalent to TD error $\delta$.
$\delta_t = R_{t+1}  \bar{R} + \hat{v}(S_{t+1}, \mathbf{w})  \hat{v}(S_{t}, \mathbf{w}) \hspace{6em} (1)$
Average Reward update rule: $\bar{R} \leftarrow \bar{R} + \alpha^{\bar{R}}\delta \hspace{4.3em} (2)$
Critic weight update rule: $\mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\delta\nabla \hat{v}(s,\mathbf{w}) \hspace{2.5em} (3)$
Actor weight update rule: $\mathbf{\theta} \leftarrow \mathbf{\theta} + \alpha^{\mathbf{\theta}}\delta\nabla ln \pi(AS,\mathbf{\theta}) \hspace{1.4em} (4)$
However, since we are using linear function approximation and parameterizing a softmax policy, the above update rule can be further simplified using:
$\nabla \hat{v}(s,\mathbf{w}) = \mathbf{x}(s) \hspace{14.2em} (5)$
$\nabla ln \pi(AS,\mathbf{\theta}) = \mathbf{x}_h(s,a)  \sum_b \pi(bs, \mathbf{\theta})\mathbf{x}_h(s,b) \hspace{3.3em} (6)$
1  #  
Run the following code to verify agent_start()
.
Although there is randomness due to self.rand_generator.choice()
in agent_policy()
, we control the seed so your output should match the expected output.
1  #  
agent active_tiles: [0 1 2 3 4 5 6 7]agent selected action: 2
Run the following code to verify agent_step()
1  #  
agent next_action: 1agent avg reward: 0.03139092653589793agent first 10 values of actor weights[0]: [0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0. 0. ]agent first 10 values of actor weights[1]: [0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0.01307955 0. 0. ]agent first 10 values of actor weights[2]: [0.02615911 0.02615911 0.02615911 0.02615911 0.02615911 0.02615911 0.02615911 0.02615911 0. 0. ]agent first 10 values of critic weights: [0.39238658 0.39238658 0.39238658 0.39238658 0.39238658 0.39238658 0.39238658 0.39238658 0. 0. ]
Now that we’ve implemented all the components of environment and agent, let’s run an experiment!
We want to see whether our agent is successful at learning the optimal policy of balancing the pendulum upright. We will plot total return over time, as well as the exponential average of the reward over time. We also do multiple runs in order to be confident about our results.
The experiment/plot code is provided in the cell below.
1  #  
We will first test our implementation using 32 tilings, of size 8x8. We saw from the earlier assignment using tilecoding that many tilings promote fine discrimination, and broad tiles allows more generalization.
We conducted a wide sweep of metaparameters in order to find the best metaparameters for our Pendulum Swingup task.
We swept over the following range of metaparameters and the best metaparameter is boldfaced below:
actor stepsize: $\{\frac{2^{6}}{32}, \frac{2^{5}}{32}, \frac{2^{4}}{32}, \frac{2^{3}}{32}, \mathbf{\frac{2^{2}}{32}}, \frac{2^{1}}{32}, \frac{2^{0}}{32}, \frac{2^{1}}{32}\}$
critic stepsize: $\{\frac{2^{4}}{32}, \frac{2^{3}}{32}, \frac{2^{2}}{32}, \frac{2^{1}}{32}, \frac{2^{0}}{32}, \mathbf{\frac{2^{1}}{32}}, \frac{3}{32}, \frac{2^{2}}{32}\}$
avg reward stepsize: $\{2^{11}, 2^{10} , 2^{9} , 2^{8}, 2^{7}, \mathbf{2^{6}}, 2^{5}, 2^{4}, 2^{3}, 2^{2}\}$
We will do 50 runs using the above best metaparameter setting to verify your agent.
Note that running the experiment cell below will take _approximately 5 min_.
1  #  
0%  0/50 [00:00<?, ?it/s]
Run the following code to verify your experimental result.
1  #  
To evaluate performance, we plotted both the return and exponentially weighted average reward over time.
In the first plot, the return is negative because the reward is negative at every state except when the pendulum is in the upright position. As the policy improves over time, the agent accumulates less negative reward, and thus the return decreases slowly. Towards the end the slope is almost flat indicating the policy has stabilized to a good policy. When using this plot however, it can be difficult to distinguish whether it has learned an optimal policy. The nearoptimal policy in this Pendulum Swingup Environment is to maintain the pendulum in the upright position indefinitely, getting near 0 reward at each time step. We would have to examine the slope of the curve but it can be hard to compare the slope of different curves.
The second plot using exponential average reward gives a better visualization. We can see that towards the end the value is near 0, indicating it is getting near 0 reward at each time step. Here, the exponentially weighted average reward shouldn’t be confused with the agent’s internal estimate of the average reward. To be more specific, we used an exponentially weighted average of the actual reward without initial bias (Refer to Exercise 2.7 from the textbook (p.35) to read more about removing the initial bias). If we used sample averages instead, later rewards would have decreasing impact on the average and would not be able to represent the agent’s performance with respect to its current policy effectively.
It is easier to see whether the agent has learned a good policy in the second plot than the first plot. If the learned policy is optimal, the exponential average reward would be close to 0.
Furthermore, how did we pick the best metaparameter from the sweeps? A common method would be to pick the metaparameter that results in the largest Area Under the Curve (AUC). However, this is not always what we want. We want to find a set of metaparameters that learns a good final policy. When using AUC as the criteria, we may pick metaparameters that allows the agent to learn fast but converge to a worse policy. In our case, we selected the metaparameter setting that obtained the most exponential average reward over the last 5000 time steps.
In addition to finding the best metaparameters it is also equally important to plot parameter sensitivity curves to understand how our algorithm behaves.
In our simulated Pendulum problem, we can extensively test our agent with different metaparameter configurations but it would be quite expensive to do so in real life. Parameter sensitivity curves can provide us insight into how our algorithms might behave in general. It can help us identify a good range of each metaparameters as well as how sensitive the performance is with respect to each metaparameter.
Here are the sensitivity curves for the three stepsizes we swept over:
On the yaxis we use the performance measure, which is the average of the exponential average reward over the 5000 time steps, averaged over 50 different runs. On the xaxis is the metaparameter we are testing. For the given metaparameter, the remaining metaparameters are chosen such that it obtains the best performance.
The curves are quite rounded, indicating the agent performs well for these wide range of values. It indicates that the agent is not too sensitive to these metaparameters. Furthermore, looking at the yaxis values we can observe that average reward stepsize is particularly less sensitive than actor stepsize and critic stepsize.
But how do we know that we have sufficiently covered a wide range of metaparameters? It is important that the best value is not on the edge but in the middle of the metaparameter sweep range in these sensitivity curves. Otherwise this may indicate that there could be better metaparameter values that we did not sweep over.
You have implemented your own Average Reward ActorCritic with Softmax Policy agent in the Pendulum Swingup Environment. You implemented the environment based on information about the state/action space and transition dynamics. Furthermore, you have learned how to implement an agent in a continuing task using the average reward formulation. We parameterized the policy using softmax of actionpreferences over discrete action spaces, and used ActorCritic to learn the policy.
To summarize, you have learned how to:
1. Implement softmax actorcritic agent on a continuing task using the average reward formulation.2. Understand how to parameterize the policy as a function to learn, in a discrete action environment.3. Understand how to (approximately) sample the gradient of this objective to update the actor.4. Understand how to update the critic using differential TD error.
]]>Welcome to Assignment 3. In this notebook you will learn how to:
As with the rest of the notebooks do not import additional libraries or adjust grading cells as this will break the grader.
MAKE SURE TO RUN ALL OF THE CELLS SO THE GRADER GETS THE OUTPUT IT NEEDS
1  # Import Necessary Libraries 
In the above cell, we import the libraries we need for this assignment. You may have noticed that we import mountaincar_env. This is the Mountain Car Task introduced in Section 10.1 of the textbook. The task is for an under powered car to make it to the top of a hill:
The car is underpowered so the agent needs to learn to rock back and forth to get enough momentum to reach the goal. At each time step the agent receives from the environment its current velocity (a float between 0.07 and 0.07), and it’s current position (a float between 1.2 and 0.5). Because our state is continuous there are a potentially infinite number of states that our agent could be in. We need a function approximation method to help the agent deal with this. In this notebook we will use tile coding. We provide a tile coding implementation for you to use, imported above with tiles3.
To begin we are going to build a tile coding class for our Sarsa agent that will make it easier to make calls to our tile coder.
Tile coding is introduced in Section 9.5.4 of the textbook of the textbook as a way to create features that can both provide good generalization and discrimination. It consists of multiple overlapping tilings, where each tiling is a partitioning of the space into tiles.
To help keep our agent code clean we are going to make a function specific for tile coding for our Mountain Car environment. To help we are going to use the Tiles3 library. This is a Python 3 implementation of the tile coder. To start take a look at the documentation: Tiles3 documentation
To get the tile coder working we need to implement a few pieces:
1  #  
1  #  
We are now going to use the functions that we just created to implement the Sarsa algorithm. Recall from class that Sarsa stands for State, Action, Reward, State, Action.
For this case we have given you an argmax function similar to what you wrote back in Course 1 Assignment 1. Recall, this is different than the argmax function that is used by numpy, which returns the first index of a maximum value. We want our argmax function to arbitrarily break ties, which is what the imported argmax function does. The given argmax function takes in an array of values and returns an int of the chosen action:
argmax(action values)
There are multiple ways that we can deal with actions for the tile coder. Here we are going to use one simple method  make the size of the weight vector equal to (iht_size, num_actions). This will give us one weight vector for each action and one weight for each tile.
Use the above function to help fill in select_action, agent_start, agent_step, and agent_end.
Hints:
1) The tile coder returns a list of active indexes (e.g. [1, 12, 22]). You can index a numpy array using an array of values  this will return an array of the values at each of those indices. So in order to get the value of a state we can index our weight vector using the action and the array of tiles that the tile coder returns:
1 

1  #  
action distribution: [ 29. 35. 936.]
1  #  
RUN: 0RUN: 5Run time: 13.416615009307861
The learning rate of your agent should look similar to ours, though it will not look exactly the same.If there are some spikey points that is okay. Due to stochasticity, a few episodes may have taken much longer, causing some spikes in the plot. The trend of the line should be similar, though, generally decreasing to about 200 steps per run.
This result was using 8 tilings with 8x8 tiles on each. Let’s see if we can do better, and what different tilings look like. We will also text 2 tilings of 16x16 and 4 tilings of 32x32. These three choices produce the same number of features (512), but distributed quite differently.
1  #  
RUN: 0RUN: 5RUN: 10RUN: 15stepsize: 0.25Run Time: 71.2762451171875RUN: 0RUN: 5RUN: 10RUN: 15stepsize: 0.015625Run Time: 38.23225665092468RUN: 0RUN: 5RUN: 10RUN: 15stepsize: 0.0625Run Time: 42.84800481796265<matplotlib.legend.Legend at 0x7fbda7e76910>
Here we can see that using 32 tilings and 4 x 4 tiles does a little better than 8 tilings with 8x8 tiles. Both seem to do much better than using 2 tilings, with 16 x 16 tiles.
Congratulations! You have learned how to implement a control agent using function approximation. In this notebook you learned how to:
Welcome to Course 3 Programming Assignment 2. In the previous assignment, you implemented semigradient TD with State Aggregation for solving a policy evaluation task. In this assignment, you will implement semigradient TD with a simple Neural Network and use it for the same policy evaluation problem.
You will implement an agent to evaluate a fixed policy on the 500State Randomwalk. As you may remember from the previous assignment, the 500state Randomwalk includes 500 states. Each episode begins with the agent at the center and terminates when the agent goes far left beyond state 1 or far right beyond state 500. At each time step, the agent selects to move either left or right with equal probability. The environment determines how much the agent moves in the selected direction.
In this assignment, you will:
We import the following libraries that are required for this assignment:
1  # Do not modify this cell! 
In this section, you will implement an Agent that learns with semigradient TD with a neural network. You will use a neural network with one hidden layer. The input of the neural network is the onehot encoding of the state number. We use the onehot encoding of the state number instead of the state number itself because we do not want to build the prior knowledge that integer number inputs close to each other have similar values. The hidden layer contains 100 rectifier linear units (ReLUs) which pass their input if it is bigger than one and return 0 otherwise. ReLU gates are commonly used in neural networks due to their nice properties such as the sparsity of the activation and having nonvanishing gradients. The output of the neural network is the estimated state value. It is a linear function of the hidden units as is commonly the case when estimating the value of a continuous target using neural networks.
The neural network looks like this:
For a given input, $s$, value of $s$ is computed by:
where $W^{[0]}$, $b^{[0]}$, $W^{[1]}$, $b^{[1]}$ are the parameters of the network and will be learned when training the agent.
Before implementing the agent, you first implement some helper functions which you will later use in agent’s main methods.
get_value()
First, you will implement get_value() method which feeds an input $s$ into the neural network and returns the output of the network $v$ according to the equations above. To implement get_value(), take into account the following notes:
get_value()
gets the onehot encoded state number denoted by s as an input. get_value()
receives the weights of the neural network as input, denoted by weights and structured as an array of dictionaries. Each dictionary corresponds to weights from one layer of the neural network to the next. Each dictionary includes $W$ and $b$. The shape of the elements in weights are as follows:
The input of the neural network is a sparse vector. To make computation faster, we take advantage of input sparsity. To do so, we provided a helper method my_matmul()
. Make sure that you use my_matmul()
for all matrix multiplications except for elementwise multiplications in this notebook.
1  def my_matmul(x1, x2): 
1  #  
Run the following code to test your implementation of the get_value()
function:
1  #  
Estimated value: [[0.21915705]]
Expected output:
Estimated value: [[0.21915705]]
get_gradient()
You will also implement get_gradient()
method which computes the gradient of the value function for a given input, using backpropagation. You will later use this function to update the value function.
As you know, we compute the value of a state $s$ according to:
To update the weights of the neural network ($W^{[0]}$, $b^{[0]}$, $W^{[1]}$, $b^{[1]}$), we compute the gradient of $v$ with respect to the weights according to:
where $\odot$ denotes elementwise matrix multiplication and $I_{x>0}$ is the gradient of the ReLU activation function which is an indicator whose $i$th element is 1 if $x[i]>0$ and 0 otherwise.
1  #  
Run the following code to test your implementation of the get_gradient()
function:
1  #  
Expected output:
grads[0]["W"] [[0. 0. ] [0. 0. ] [0. 0. ] [0.76103773 0.12167502] [0. 0. ]] grads[0]["b"] [[0.76103773 0.12167502]] grads[1]["W"] [[0.69198983] [0.82403662]] grads[1]["b"] [[1.]]
In this section, you will implement stochastic gradient descent (SGD) method for state_value prediction. Here is the basic SGD update for statevalue prediction with TD:
At each time step, we update the weights in the direction $g_t = \delta_t \nabla \hat{v}(S_t,\mathbf{w_t})$ using a fixed stepsize $\alpha$. $\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1},\mathbf{w_{t}})  \hat{v}(S_t,\mathbf{w_t})$ is the TDerror. $\nabla \hat{v}(S_t,\mathbf{w_{t}})$ is the gradient of the value function with respect to the weights.
The following cell includes the SGD class. You will complete the update_weight()
method of SGD assuming that the weights and update g are provided.
As you know, in this assignment, we structured the weights as an array of dictionaries. Note that the updates $g_t$, in the case of TD, is $\delta_t \nabla \hat{v}(S_t,\mathbf{w_t})$. As a result, $g_t$ has the same structure as $\nabla \hat{v}(S_t,\mathbf{w_t})$ which is also an array of dictionaries.
1  #  
Run the following code to test your implementation of the update_weights()
function:
1  #  
Expected output:
updated_weights[0]["W"] [[ 1.17899492 0.53656321] [ 0.58008221 1.47666572] [ 1.01909411 1.10248056] [ 0.72490408 0.06828853] [0.20609725 0.69034095]] updated_weights[0]["b"] [[0.18484533 0.92844539]] updated_weights[1]["W"] [[0.70488257] [0.58150878]] updated_weights[1]["b"] [[0.88467086]]
In this assignment, instead of using SGD for updating the weights, we use a more advanced algorithm called Adam. The Adam algorithm improves the SGD update with two concepts: adaptive vector stepsizes and momentum. It keeps estimates of the mean and second moment of the updates, denoted by $\mathbf{m}$ and $\mathbf{v}$ respectively:
Given that $\mathbf{m}$ and $\mathbf{v}$ are initialized to zero, they are biased toward zero. To get unbiased estimates of the mean and second moment, Adam defines $\mathbf{\hat{m}}$ and $\mathbf{\hat{v}}$ as:
The weights are then updated as follows:
When implementing the agent you will use the Adam algorithm instead of SGD because it is more efficient. We have already provided you the implementation of the Adam algorithm in the cell below. You will use it when implementing your agent.
1  #  
In this section, you will implement agent_init()
, agent_start()
, agent_step()
, and agent_end()
.
In agent_init()
, you will:
This initialization heuristic is commonly used when using ReLU gates and helps keep the output of a neuron from getting too big or too small. To initialize the network’s parameters, use self.rand_generator.normal() which draws random samples from a normal distribution. The parameters of self.rand_generator.normal are mean of the distribution, standard deviation of the distribution, and output shape in the form of tuple of integers.
In agent_start()
, you will:
In agent_step()
and agent_end()
, you will:
one_hot()
method that we provided below. You feed the onehot encoded state number to the neural networks using get_value()
method that you implemented above. Note that one_hot()
method returns the onehot encoding of a state as a numpy array of shape (1, num_states).get_gradient()
function that you implemented.agent_policy()
method to select actions with. (only in agent_step()
)1  #  
1  #  
Run the following code to test your implementation of the agent_init()
function:
1  #  
layer_size: [5 2 1]
Expected output:
layer_size: [5 2 1]weights[0]["W"] shape: (5, 2)weights[0]["b"] shape: (1, 2)weights[1]["W"] shape: (2, 1)weights[1]["b"] shape: (1, 1) weights[0]["W"] [[ 1.11568467 0.25308164] [ 0.61900825 1.4172653 ] [ 1.18114738 0.6180848 ] [ 0.60088868 0.0957267 ] [0.06528133 0.25968529]] weights[0]["b"] [[0.09110115 0.91976332]] weights[1]["W"] [[0.76103773] [0.12167502]] weights[1]["b"] [[0.44386323]]
Run the following code to test your implementation of the agent_start()
function:
1  #  
Expected output:
Agent state: 250Agent selected action: 1
Run the following code to test your implementation of the agent_step()
function:
1  #  
Expected output:
updated_weights[0]["W"] [[ 1.10893459 0.30763738] [ 0.63690565 1.14778865] [ 1.23397791 0.48152743] [ 0.72792093 0.15829832] [ 0.15021996 0.39822163]] updated_weights[0]["b"] [[0.29798822 0.96254535]] updated_weights[1]["W"] [[0.76628754] [0.11486511]] updated_weights[1]["b"] [[0.58530057]] Agent last state: 1Agent last action: 1
Run the following code to test your implementation of the agent_end()
function:
1  #  
Expected output:
updated_weights[0]["W"] [[ 1.10893459 0.30763738] [ 0.63690565 1.14778865] [ 1.17531054 0.51043162] [ 0.75062903 0.13736817] [ 0.15021996 0.39822163]] updated_weights[0]["b"] [[0.30846523 0.95937346]] updated_weights[1]["W"] [[0.68861703] [0.15986364]] updated_weights[1]["b"] [[0.586074]]
Now that you implemented the agent, we can run the experiment. Similar to Course 3 Programming Assignment 1, we will plot the learned state value function and the learning curve of the TD agent. To plot the learning curve, we use Root Mean Squared Value Error (RMSVE).
We have already provided you the experiment/plot code, so you can go ahead and run the two cells below.
Note that running the cell below will take approximately 12 minutes.
1  #  
Setting  Neural Network with 100 hidden units
'/home/jovyan/work/release/TDNN/results.zip'
You plotted the learning curve for 1000 episodes. As you can see the RMSVE is still decreasing. Here we provide the precomputed result for 5000 episodes and 20 runs so that you can see the performance of semigradient TD with a neural network after being trained for a long time.
Does semigradient TD with a neural network find a good approximation within 5000 episodes?
As you may remember from the previous assignment, semigradient TD with 10state aggregation converged within 100 episodes. Why is TD with a neural network slower?
Would it be faster if we decrease the number of hidden units? Or what about if we increase the number of hidden units?
In this section, we compare the performance of semigradient TD with a Neural Network and semigradient TD with tilecoding. Tilecoding is a kind of coarse coding that uses multiple overlapping partitions of the state space to produce features. For tilecoding, we used 50 tilings each with 6 tiles. We set the stepsize for semigradient TD with tilecoding to $\frac{0.1}{tilings}$. See the figure below for the comparison between semigradient TD with tilecoding and semigradient TD with a neural network and Adam algorithm. This result is for 5000 episodes and 20 runs:
How are the results?
Semigradient TD with tilecoding is much faster than semigradient TD with a neural network. Why?
Which method has a lower RMSVE at the end of 5000 episodes?
You have successfully implemented Course 3 Programming Assignment 2.
You have implemented semigradient TD with a Neural Network and Adam algorithm in 500state Random Walk.
You also compared semigradient TD with a neural network and semigradient TD with tilecoding.
From the experiments and lectures, you should be more familiar with some of the strengths and weaknesses of using neural networks as the function approximator for an RL agent. On one hand, neural networks are powerful function approximators capable of representing a wide class of functions. They are also capable of producing features without exclusively relying on handcrafted mechanisms. On the other hand, compared to a linear function approximator with tilecoding, neural networks can be less sample efficient. When implementing your own Reinforcement Learning agents, you may consider these strengths and weaknesses to choose the proper function approximator for your problems.
]]>Welcome to your Course 3 Programming Assignment 1. In this assignment, you will implement semigradient TD(0) with State Aggregation in an environment with a large state space. This assignment will focus on the policy evaluation task (prediction problem) where the goal is to accurately estimate state values under a given (fixed) policy.
In this assignment, you will:
Note: You can create new cells for debugging purposes but please do not duplicate any Readonly cells. This may break the grader.
In this assignment, we will implement and use a smaller 500 state version of the problem we covered in lecture (see “State Aggregation with Monte Carlo”, and Example 9.1 in the textbook). The diagram below illustrates the problem.
There are 500 states numbered from 1 to 500, left to right, and all episodes begin with the agent located at the center, in state 250. For simplicity, we will consider state 0 and state 501 as the left and right terminal states respectively.
The episode terminates when the agent reaches the terminal state (state 0) on the left, or the terminal state (state 501) on the right. Termination on the left (state 0) gives the agent a reward of 1, and termination on the right (state 501) gives the agent a reward of +1.
The agent can take one of two actions: go left or go right. If the agent chooses the left action, then it transitions uniform randomly into one of the 100 neighboring states to its left. If the agent chooses the right action, then it transitions randomly into one of the 100 neighboring states to its right.
States near the edge may have fewer than 100 neighboring states on that side. In this case, all transitions that would have taken the agent past the edge result in termination. If the agent takes the left action from state 50, then it has a 0.5 chance of terminating on the left. If it takes the right action from state 499, then it has a 0.99 chance of terminating on the right.
For this assignment, we will consider the problem of policy evaluation: estimating statevalue function for a fixed policy.You will evaluate a uniform random policy in the 500State Random Walk environment. This policy takes the right action with 0.5 probability and the left with 0.5 probability, regardless of which state it is in.
This environment has a relatively large number of states. Generalization can significantly speed learning as we will show in this assignment. Often in realistic environments, states are highdimensional and continuous. For these problems, function approximation is not just useful, it is also necessary.
You will use the following packages in this assignment.
Please do not import other libraries  this will break the autograder.
1  import numpy as np 
In this section we have provided you with the implementation of the 500State RandomWalk Environment. It is useful to know how the environment is implemented. We will also use this environment in the next programming assignment.
Once the agent chooses which direction to move, the environment determines how far the agent is moved in that direction. Assume the agent passes either 0 (indicating left) or 1 (indicating right) to the environment.
Methods needed to implement the environment are: env_init
, env_start
, and env_step
.
env_init
: This method sets up the environment at the very beginning of the experiment. Relevant parameters are passed through env_info
dictionary.env_start
: This is the first method called when the experiment starts, returning the start state.env_step
: This method takes in action and returns reward, next_state, and is_terminal.1  #  
Now let’s create the Agent that interacts with the Environment.
You will create an Agent that learns with semigradient TD(0) with state aggregation.
For state aggregation, if the resolution (num_groups) is 10, then 500 states are partitioned into 10 groups of 50 states each (i.e., states 150 are one group, states 51100 are another, and so on.)
Hence, 50 states would share the same feature and value estimate, and there would be 10 distinct features. The feature vector for each state is a onehot feature vector of length 10, with a single one indicating the group for that state. (onehot vector of length 10)
Before we implement the agent, we need to define a couple of useful helper functions.
Please note all random method calls should be called through random number generator. Also do not use random method calls unless specified. In the agent, only agent_policy
requires random method calls.
In this part we have implemented agent_policy()
for you.
This method is used in agent_start()
and agent_step()
to select appropriate action.
Normally, the agent acts differently given state, but in this environment the agent chooses randomly to move either left or right with equal probability.
Agent returns 0 for left, and 1 for right.
1  #  
In this part you will implement get_state_feature()
This method takes in a state and returns the aggregated feature (onehotvector) of that state.
The feature vector size is determined by num_groups
. Use state
and num_states_in_group
to determine which element in the feature vector is active.
get_state_feature()
is necessary whenever the agent receives a state and needs to convert it to a feature for learning. The features will thus be used in agent_step()
and agent_end()
when the agent updates its state values.
1  #  
Run the following code to verify your get_state_feature()
function.
1  #  
1st group: [1. 0. 0. 0. 0.]2nd group: [0. 1. 0. 0. 0.]3rd group: [0. 0. 1. 0. 0.]4th group: [0. 0. 0. 1. 0.]5th group: [0. 0. 0. 0. 1.]
Now that we have implemented all the helper functions, let’s create an agent. In this part, you will implement agent_init()
, agent_start()
, agent_step()
and agent_end()
. You will have to use agent_policy()
that we implemented above. We will implement agent_message()
later, when returning the learned statevalues.
To save computation time, we precompute features for all states beforehand in agent_init()
. The precomputed features are saved in self.all_state_features
numpy array. Hence, you do not need to call get_state_feature()
every time in agent_step()
and agent_end()
.
The shape of self.all_state_features
numpy array is (num_states, feature_size)
, with features of states from State 1500. Note that index 0 stores features for State 1 (Features for State 0 does not exist). Use self.all_state_features
to access each feature vector for a state.
When saving state values in the agent, recall how the state values are represented with linear function approximation.
State Value Representation: $\hat{v}(s,\mathbf{w}) = \mathbf{w}\cdot\mathbf{x^T}$ where $\mathbf{w}$ is a weight vector and $\mathbf{x}$ is the feature vector of the state.
When performing TD(0) updates with Linear Function Approximation, recall how we perform semigradient TD(0) updates using supervised learning.
semigradient TD(0) Weight Update Rule: $\mathbf{w_{t+1}} = \mathbf{w_{t}} + \alpha [R_{t+1} + \gamma \hat{v}(S_{t+1},\mathbf{w})  \hat{v}(S_t,\mathbf{w})] \nabla \hat{v}(S_t,\mathbf{w})$
1  #  
Run the following code to verify agent_init()
1  #  
num_states: 500num_groups: 10step_size: 0.1discount_factor: 1.0weights shape: (10,)weights init. value: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Run the following code to verify agent_start()
.
Although there is randomness due to rand_generator.choice()
in agent_policy()
, we control the seed so your output should match the expected output.
Make sure rand_generator.choice()
is called only once per agent_policy()
call.
1  #  
Agent state: 250Agent selected action: 1
Run the following code to verify agent_step()
1  #  
Updated weights: [0.26 0.5 1. 0.5 1.5 0.5 1.5 0. 0.5 1. ]last state: 120last action: 1
Run the following code to verify agent_end()
1  #  
Updated weights: [0.35 0.5 1. 0.5 1.5 0.5 1.5 0. 0.5 1. ]
Expected output: (Note only the 1st element was changed, and the result is different from agent_step()
)
Initial weights: [1.5 0.5 1. 0.5 1.5 0.5 1.5 0. 0.5 1. ]Updated weights: [0.35 0.5 1. 0.5 1.5 0.5 1.5 0. 0.5 1. ]
You are almost done! Now let’s implement a code block in agent_message()
that returns the learned state values.
The method agent_message()
will return the learned state_value array when message == 'get state value'
.
Hint: Think about how state values are represented with linear function approximation. state_value
array will be a 1D array with length equal to the number of states.
1  %%add_to TDAgent 
Run the following code to verify get_state_val()
1  #  
State value shape: (20,)Initial State value for all states: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Expected Output:
State value shape: (20,)Initial State value for all states: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Now that we’ve implemented all the components of environment and agent, let’s run an experiment! We will plot two things: (1) the learned state value function and compare it against the true state values, and (2) a learning curve depicting the error in the learned value estimates over episodes. For the learning curve, what should we plot to see if the agent is learning well?
Recall that the Prediction Objective in function approximation is Mean Squared Value Error $\overline{VE}(\mathbf{w}) \doteq \sum\limits_{s \in \mathcal{S}}\mu(s)[v_\pi(s)\hat{v}(s,\mathbf{w})]^2$
We will use the square root of this measure, the root $\overline{VE}$ to give a rough measure of how much the learned values differ from the true values.
calc RMSVE()
computes the Root Mean Squared Value Error given learned state value $\hat{v}(s, \mathbf{w})$.
We provide you with true state value $v_\pi(s)$ and state distribution $\mu(s)$
1  #  
We have provided you the experiment/plot code in the cell below.
1  #  
We will first test our implementation using state aggregation with resolution of 10, with three different step sizes: {0.01, 0.05, 0.1}.
Note that running the experiment cell below will take _approximately 5 min_.
1  #  
Setting  num. agg. states: 10, step_size: 0.01100%██████████ 50/50 [01:32<00:00, 1.85s/it]Setting  num. agg. states: 10, step_size: 0.05100%██████████ 50/50 [01:33<00:00, 1.87s/it]Setting  num. agg. states: 10, step_size: 0.1100%██████████ 50/50 [01:31<00:00, 1.83s/it]
Is the learned state value plot with stepsize=0.01 similar to Figure 9.2 (p.208) in Sutton and Barto?
(Note that our environment has less states: 500 states and we have done 2000 episodes, and averaged the performance over 50 runs)
Look at the plot of the learning curve. Does RMSVE decrease over time?
Would it be possible to reduce RMSVE to 0?
You should see the RMSVE decrease over time, but the error seems to plateau. It is impossible to reduce RMSVE to 0, because of function approximation (and we do not decay the stepsize parameter to zero). With function approximation, the agent has limited resources and has to tradeoff the accuracy of one state for another state.
Run the following code to verify your experimental result.
1  #  
Your experiment results are correct!
In this section, we will run some more experiments to see how different parameter settings affect the results!
In particular, we will test several values of num_groups
and step_size
. Parameter sweeps although necessary, can take lots of time. So now that you have verified your experiment result, here we show you the results of the parameter sweeps that you would see when running the sweeps yourself.
We tested several different values of num_groups
: {10, 100, 500}, and stepsize
: {0.01, 0.05, 0.1}. As before, we performed 2000 episodes per run, and averaged the results over 50 runs for each setting.
Run the cell below to display the sweep results.
1  #  
Let’s think about the results of our parameter study.
Which state aggregation resolution do you think is the best after running 2000 episodes? Which state aggregation resolution do you think would be the best if we could train for only 200 episodes? What if we could train for a million episodes?
Should we use tabular representation (state aggregation of resolution 500) whenever possible? Why might we want to use function approximation?
From the plots, using 100 state aggregation with stepsize 0.05 reaches the best performance: the lowest RMSVE after 2000 episodes. If the agent can only be trained for 200 episodes, then 10 state aggregation with stepsize 0.05 reaches the lowest error. Increasing the resolution of state aggregation makes the function approximation closer to a tabular representation, which would be able to learn exactly correct state values for all states. But learning will be slower.
The best stepsize is different for different state aggregation resolutions. A larger stepsize allows the agent to learn faster, but might not perform as well asymptotically. A smaller stepsize causes it to learn more slowly, but may perform well asymptotically.
You have implemented semigradient TD(0) with State Aggregation in a 500state Random Walk. We used an environment with a large but discrete state space, where it was possible to compute the true state values. This allowed us to compare the values learned by your agent to the true state values. The same state aggregation function approximation can also be applied to continuous state space environments, where comparison to the true values is not usually possible.
You also successfully applied supervised learning approaches to approximate value functions with semigradient TD(0).
Finally, we plotted the learned state values and compared with true state values. We also compared learning curves of different state aggregation resolutions and learning rates.
From the results, you can see why it is often desirable to use function approximation, even when tabular learning is possible. Asymptotically, an agent with tabular representation would be able to learn the true state value function, but it would learn much more slowly compared to an agent with function approximation. On the other hand, we also want to ensure we do not reduce discrimination too far (a coarse state aggregation resolution), because it will hurt the asymptotic performance.
]]>In this notebook, you’re going to implement various components of StyleGAN, including the truncation trick, the mapping layer, noise injection, adaptive instance normalization (AdaIN), and progressive growing.
You will begin by importing some packages from PyTorch and defining a visualization function which will be useful later.
1  import torch 
The first component you will implement is the truncation trick. Remember that this is done after the model is trained and when you are sampling beautiful outputs. The truncation trick resamples the noise vector $z$ from a truncated normal distribution which allows you to tune the generator’s fidelity/diversity. The truncation value is at least 0, where 1 means there is little truncation (high diversity) and 0 means the distribution is all truncated except for the mean (high quality/fidelity). This trick is not exclusive to StyleGAN. In fact, you may recall playing with it in an earlier GAN notebook.
1  # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # Test the truncation sample 
Success!
The next component you need to implement is the mapping network. It takes the noise vector, $z$, and maps it to an intermediate noise vector, $w$. This makes it so $z$ can be represented in a more disentangled space which makes the features easier to control later.
The mapping network in StyleGAN is composed of 8 layers, but for your implementation, you will use a neural network with 3 layers. This is to save time training later.
MappingLayers
1  # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # Test the mapping function 
Success!
Next, you will implement the random noise injection that occurs before every AdaIN block. To do this, you need to create a noise tensor that is the same size as the current feature map (image).
The noise tensor is not entirely random; it is initialized as one random channel that is then multiplied by learned weights for each channel in the image. For example, imagine an image has 512 channels and its height and width are (4 x 4). You would first create a random (4 x 4) noise matrix with one channel. Then, your model would create 512 values—one for each channel. Next, you multiply the (4 x 4) matrix by each one of these values. This creates a “random” tensor of 512 channels and (4 x 4) pixels, the same dimensions as the image. Finally, you add this noise tensor to the image. This introduces uncorrelated noise and is meant to increase the diversity in the image.
New starting weights are generated for every new layer, or generator, where this class is used. Within a layer, every following time the noise injection is called, you take another step with the optimizer and the weights that you use for each channel are optimized (i.e. learned).
InjectNoise
1  # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNIT TEST 
Success!
The next component you will implement is AdaIN. To increase control over the image, you inject $w$ — the intermediate noise vector — multiple times throughout StyleGAN. This is done by transforming it into a set of style parameters and introducing the style to the image through AdaIN. Given an image ($x_i$) and the intermediate vector ($w$), AdaIN takes the instance normalization of the image and multiplies it by the style scale ($y_s$) and adds the style bias ($y_b$). You need to calculate the learnable style scale and bias by using linear mappings from $w$.
forward
1  # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  w_channels = 50 
Success!
The final StyleGAN component that you will create is progressive growing. This helps StyleGAN to create high resolution images by gradually doubling the image’s size until the desired size.
You will start by creating a block for the StyleGAN generator. This is comprised of an upsampling layer, a convolutional layer, random noise injection, an AdaIN layer, and an activation.
1  # UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  test_stylegan_block = MicroStyleGANGeneratorBlock(in_chan=128, out_chan=64, w_dim=256, kernel_size=3, starting_size=8) 
Success!
Now, you can implement progressive growing.
StyleGAN starts with a constant 4 x 4 (x 512 channel) tensor which is put through an iteration of the generator without upsampling. The output is some noise that can then be transformed into a blurry 4 x 4 image. This is where the progressive growing process begins. The 4 x 4 noise can be further passed through a generator block with upsampling to produce an 8 x 8 output. However, this will be done gradually.
You will simulate progressive growing from an 8 x 8 image to a 16 x 16 image. Instead of simply passing it to the generator block with upsampling, StyleGAN gradually trains the generator to the new size by mixing in an image that was only upsampled. By mixing an upsampled 8 x 8 image (which is 16 x 16) with increasingly more of the 16 x 16 generator output, the generator is more stable as it progressively trains. As such, you will do two separate operations with the 8 x 8 noise:
You will now have two images that are both double the resolution of the 8 x 8 noise. Then, using an alpha ($\alpha$) term, you combine the higher resolution images obtained from (1) and (2). You would then pass this into the discriminator and use the feedback to update the weights of your generator. The key here is that the $\alpha$ term is gradually increased until eventually, only the image from (1), the generator, is used. That is your final image or you could continue this process to make a 32 x 32 image or 64 x 64, 128 x 128, etc.
This micro model you will implement will visualize what the model outputs at a particular stage of training, for a specific value of $\alpha$. However to reiterate, in practice, StyleGAN will slowly phase out the upsampled image by increasing the $\alpha$ parameter over many training steps, doing this process repeatedly with larger and larger alpha values until it is 1—at this point, the combined image is solely comprised of the image from the generator block. This method of gradually training the generator increases the stability and fidelity of the model.
forward
1  # UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  z_dim = 128 
Success!
Finally, you can put all the components together to run an iteration of your micro StyleGAN!
You can also visualize what this randomly initiated generator can produce. The code will automatically interpolate between different values of alpha so that you can intuitively see what it means to mix the lowresolution and highresolution images using different values of alpha. In the generated image, the samples start from low alpha values and go to high alpha values.
1  import numpy as np 
1 
In this notebook, you’re going to gain a better understanding of some of the challenges that come with evaluating GANs and a response you can take to alleviate some of them called Fréchet Inception Distance (FID).
One aspect that makes evaluating GANs challenging is that the loss tells us little about their performance. Unlike with classifiers, where a low loss on a test set indicates superior performance, a low loss for the generator or discriminator suggests that learning has stopped.
If you define the goal of a GAN as “generating images which look real to people” then it’s technically possible to measure this directly: you can ask people to act as a discriminator. However, this takes significant time and money so ideally you can use a proxy for this. There is also no “perfect” discriminator that can differentiate reals from fakes  if there were, a lot of machine learning tasks would be solved ;)
In this notebook, you will implement Fréchet Inception Distance, one method which aims to solve these issues.
For this notebook, you will again be using CelebA. You will start by loading a pretrained generator which has been trained on CelebA.
Here, you will import some useful libraries and packages. You will also be provided with the generator and noise code from earlier assignments.
1  import torch 
Now, you can set the arguments for the model and load the dataset:
1  z_dim = 64 
Then, you can load and initialize the model with weights from a pretrained model. This allows you to use the pretrained model as if you trained it yourself.
1  gen = Generator(z_dim).to(device) 
InceptionV3 is a neural network trained on ImageNet to classify objects. You may recall from the lectures that ImageNet has over 1 million images to train on. As a result, InceptionV3 does a good job detecting features and classifying images. Here, you will load InceptionV3 as inception_model
.
1  from torchvision.models import inception_v3 
Fréchet Inception Distance (FID) was proposed as an improvement over Inception Score and still uses the Inceptionv3 network as part of its calculation. However, instead of using the classification labels of the Inceptionv3 network, it uses the output from an earlier layer—the layer right before the labels. This is often called the feature layer. Research has shown that deep convolutional neural networks trained on difficult tasks, like classifying many classes, build increasingly sophisticated representations of features going deeper into the network. For example, the first few layers may learn to detect different kinds of edges and curves, while the later layers may have neurons that fire in response to human faces.
To get the feature layer of a convolutional neural network, you can replace the final fully connected layer with an identity layer that simply returns whatever input it received, unchanged. This essentially removes the final classification layer and leaves you with the intermediate outputs from the layer before.
inception_model.fc
1  # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNIT TEST 
Success!
Fréchet distance uses the values from the feature layer for two sets of images, say reals and fakes, and compares different statistical properties between them to see how different they are. Specifically, Fréchet distance finds the shortest distance needed to walk along two lines, or two curves, simultaneously. The most intuitive explanation of Fréchet distance is as the “minimum leash distance” between two points. Imagine yourself and your dog, both moving along two curves. If you walked on one curve and your dog, attached to a leash, walked on the other at the same pace, what is the least amount of leash that you can give your dog so that you never need to give them more slack during your walk? Using this, the Fréchet distance measures the similarity between these two curves.
The basic idea is similar for calculating the Fréchet distance between two probability distributions. You’ll start by seeing what this looks like in onedimensional, also called univariate, space.
You can calculate the distance between two normal distributions $X$ and $Y$ with means $\mu_X$ and $\mu_Y$ and standard deviations $\sigma_X$ and $\sigma_Y$, as:
Pretty simple, right? Now you can see how it can be converted to be used in multidimensional, which is also called multivariate, space.
Covariance
To find the Fréchet distance between two multivariate normal distributions, you first need to find the covariance instead of the standard deviation. The covariance, which is the multivariate version of variance (the square of standard deviation), is represented using a square matrix where the side length is equal to the number of dimensions. Since the feature vectors you will be using have 2048 values/weights, the covariance matrix will be 2048 x 2048. But for the sake of an example, this is a covariance matrix in a twodimensional space:
$\Sigma = \left(\begin{array}{cc}
1 & 0\\
0 & 1
\end{array}\right)
$
The value at location $(i, j)$ corresponds to the covariance of vector $i$ with vector $j$. Since the covariance of $i$ with $j$ and $j$ with $i$ are equivalent, the matrix will always be symmetric with respect to the diagonal. The diagonal is the covariance of that element with itself. In this example, there are zeros everywhere except the diagonal. That means that the two dimensions are independent of one another, they are completely unrelated.
The following code cell will visualize this matrix.
1  #import os 
Now, here’s an example of a multivariate normal distribution that has covariance:
$\Sigma = \left(\begin{array}{cc}
2 & 1\\
1 & 2
\end{array}\right)
$
And see how it looks:
1  mean = torch.Tensor([0, 0]) 
Formula
Based on the paper, “The Fréchet distance between multivariate normal distributions“ by Dowson and Landau (1982), the Fréchet distance between two multivariate normal distributions $X$ and $Y$ is:
$d(X, Y) = \Vert\mu_X\mu_Y\Vert^2 + \mathrm{Tr}\left(\Sigma_X+\Sigma_Y  2 \sqrt{\Sigma_X \Sigma_Y}\right)$
Similar to the formula for univariate Fréchet distance, you can calculate the distance between the means and the distance between the standard deviations. However, calculating the distance between the standard deviations changes slightly here, as it includes the matrix product and matrix square root. $\mathrm{Tr}$ refers to the trace, the sum of the diagonal elements of a matrix.
Now you can implement this!
frechet_distance
1  import scipy 
1  # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNIT TEST 
Success!
Now, you can apply FID to your generator from earlier.
You will start by defining a bit of helper code to preprocess the image for the Inceptionv3 network:
1  def preprocess(img): 
Then, you’ll define a function to calculate the covariance of the features that returns a covariance matrix given a list of values:
1  import numpy as np 
Finally, you can use the pretrained Inceptionv3 model to compute features of the real and fake images. With these features, you can then get the covariance and means of these features across many samples.
First, you get the features of the real and fake images using the Inceptionv3 model:
1  fake_features_list = [] 
HBox(children=(FloatProgress(value=0.0, max=128.0), HTML(value='')))
Then, you can combine all of the values that you collected for the reals and fakes into large tensors:
1  # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
And calculate the covariance and means of these real and fake features:
1  # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  assert tuple(sigma_fake.shape) == (fake_features_all.shape[1], fake_features_all.shape[1]) 
Success!
At this point, you can also visualize what the pairwise multivariate distributions of the inception features look like!
1  indices = [2, 4, 5] 
<seaborn.axisgrid.PairGrid at 0x7fa847b2ab38>
Lastly, you can use your earlier frechet_distance
function to calculate the FID and evaluate your GAN. You can see how similar/different the features of the generated images are to the features of the real images. The next cell might take five minutes or so to run in Coursera.
1  with torch.no_grad(): 
86.48429107666016
You’ll notice this model gets a pretty high FID, likely over 30. Since lower is better, and the best models on CelebA get scores in the singledigits, there’s clearly a ways to go with this model. You can use FID to compare different models, as well as different stages of training of the same model.
1 
In this notebook, you’re going to make a conditional GAN in order to generate handwritten images of digits, conditioned on the digit to be generated (the class vector). This will let you choose what digit you want to generate.
You’ll then do some exploration of the generated images to visualize what the noise and class vectors mean.
For this assignment, you will be using the MNIST dataset again, but there’s nothing stopping you from applying this generator code to produce images of animals conditioned on the species or pictures of faces conditioned on facial characteristics.
Note that this assignment requires no changes to the architectures of the generator or discriminator, only changes to the data passed to both. The generator will no longer take z_dim
as an argument, but input_dim
instead, since you need to pass in both the noise and class vectors. In addition to good variable naming, this also means that you can use the generator and discriminator code you have previously written with different parameters.
You will begin by importing the necessary libraries and building the generator and discriminator.
1  import torch 
1  class Generator(nn.Module): 
1  class Discriminator(nn.Module): 
In conditional GANs, the input vector for the generator will also need to include the class information. The class is represented using a onehot encoded vector where its length is the number of classes and each index represents a class. The vector is all 0’s and a 1 on the chosen class. Given the labels of multiple images (e.g. from a batch) and number of classes, please create onehot vectors for each label. There is a class within the PyTorch functional library that can help you.
get_one_hot_labels
1  # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  assert ( 
Success!
Next, you need to be able to concatenate the onehot class vector to the noise vector before giving it to the generator. You will also need to do this when adding the class channels to the discriminator.
To do this, you will need to write a function that combines two vectors. Remember that you need to ensure that the vectors are the same type: floats. Again, you can look to the PyTorch library for help.
combine_vectors
1  # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  combined = combine_vectors(torch.tensor([[1, 2], [3, 4]]), torch.tensor([[5, 6], [7, 8]])); 
Success!
Now you can start to put it all together!
First, you will define some new parameters:
1  mnist_shape = (1, 28, 28) 
And you also include the same parameters from previous assignments:
1  criterion = nn.BCEWithLogitsLoss() 
Then, you can initialize your generator, discriminator, and optimizers. To do this, you will need to update the input dimensions for both models. For the generator, you will need to calculate the size of the input vector; recall that for conditional GANs, the generator’s input is the noise vector concatenated with the class vector. For the discriminator, you need to add a channel for every class.
1  # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  def test_input_dims(): 
Success!
1  generator_input_dim, discriminator_im_chan = get_input_dimensions(z_dim, mnist_shape, n_classes) 
Now to train, you would like both your generator and your discriminator to know what class of image should be generated. There are a few locations where you will need to implement code.
For example, if you’re generating a picture of the number “1”, you would need to:
There are no explicit unit tests here — if this block of code runs and you don’t change any of the other variables, then you’ve done it correctly!
1  # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
You can do a bit of exploration now!
1  # Before you explore, you should put the generator 
You can generate some numbers with your new model! You can add interpolation as well to make it more interesting.
So starting from a image, you will produce intermediate images that look more and more like the ending image until you get to the final image. Your’re basically morphing one image into another. You can choose what these two images will be using your conditional GAN.
1  import math 
Now, what happens if you hold the class constant, but instead you change the noise vector? You can also interpolate the noise vector and generate an image at each step.
1  n_interpolation = 9 # How many intermediate images you want + 2 (for the start and end image) 
1 
In this notebook, you’re going to build a Wasserstein GAN with Gradient Penalty (WGANGP) that solves some of the stability issues with the GANs that you have been using up until this point. Specifically, you’ll use a special kind of loss function known as the Wloss, where W stands for Wasserstein, and gradient penalties to prevent mode collapse.
Fun Fact: Wasserstein is named after a mathematician at Penn State, Leonid Vaseršteĭn. You’ll see it abbreviated to W (e.g. WGAN, Wloss, Wdistance).
You will begin by importing some useful packages, defining visualization functions, building the generator, and building the critic. Since the changes for WGANGP are done to the loss function during training, you can simply reuse your previous GAN code for the generator and critic class. Remember that in WGANGP, you no longer use a discriminator that classifies fake and real as 0 and 1 but rather a critic that scores images with real numbers.
1  import torch 
1  class Generator(nn.Module): 
1  class Critic(nn.Module): 
Now you can start putting it all together.
As usual, you will start by setting the parameters:
You will also load and transform the MNIST dataset to tensors.
1  n_epochs = 100 
Then, you can initialize your generator, critic, and optimizers.
1  gen = Generator(z_dim).to(device) 
Calculating the gradient penalty can be broken into two functions: (1) compute the gradient with respect to the images and (2) compute the gradient penalty given the gradient.
You can start by getting the gradient. The gradient is computed by first creating a mixed image. This is done by weighing the fake and real image using epsilon and then adding them together. Once you have the intermediate image, you can get the critic’s output on the image. Finally, you compute the gradient of the critic score’s on the mixed images (output) with respect to the pixels of the mixed images (input). You will need to fill in the code to get the gradient wherever you see None. There is a test function in the next block for you to test your solution.
1  # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNIT TEST 
Success!
The second function you need to complete is to compute the gradient penalty given the gradient. First, you calculate the magnitude of each image’s gradient. The magnitude of a gradient is also called the norm. Then, you calculate the penalty by squaring the distance between each magnitude and the ideal norm of 1 and taking the mean of all the squared distances.
Again, you will need to fill in the code wherever you see None. There are hints below that you can view if you need help and there is a test function in the next block for you to test your solution.
gradient_penalty
1  # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNIT TEST 
Success!
Next, you need to calculate the loss for the generator and the critic.
For the generator, the loss is calculated by maximizing the critic’s prediction on the generator’s fake images. The argument has the scores for all fake images in the batch, but you will use the mean of them.
There are optional hints below and a test function in the next block for you to test your solution.
get_gen_loss
1  # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNIT TEST 
Success!
For the critic, the loss is calculated by maximizing the distance between the critic’s predictions on the real images and the predictions on the fake images while also adding a gradient penalty. The gradient penalty is weighed according to lambda. The arguments are the scores for all the images in the batch, and you will use the mean of them.
There are hints below if you get stuck and a test function in the next block for you to test your solution.
get_crit_loss
1  # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNIT TEST 
Success!
Before you put everything together, there are a few things to note.
Here is a snapshot of what your WGANGP outputs should resemble:
1  import matplotlib.pyplot as plt 
1 
In this notebook, you’re going to create another GAN using the MNIST dataset. You will implement a Deep Convolutional GAN (DCGAN), a very successful and influential GAN model developed in 2015.
Note: here is the paper if you are interested! It might look dense now, but soon you’ll be able to understand many parts of it :)
Figure: Architectural drawing of a generator from DCGAN from Radford et al (2016).
Here are the main features of DCGAN (don’t worry about memorizing these, you will be guided through the implementation!):
1  Architecture guidelines for stable Deep Convolutional GANs 
You will begin by importing some useful packages and data that will help you create your GAN. You are also provided a visualizer function to help see the images your GAN will create.
1  import torch 
The first component you will make is the generator. You may notice that instead of passing in the image dimension, you will pass the number of image channels to the generator. This is because with DCGAN, you use convolutions which don’t depend on the number of pixels on an image. However, the number of channels is important to determine the size of the filters.
You will build a generator using 4 layers (3 hidden layers + 1 output layer). As before, you will need to write a function to create a single block for the generator’s neural network.
Since in DCGAN the activation function will be different for the output layer, you will need to check what layer is being created. You are supplied with some tests following the code cell so you can see if you’re on the right track!
At the end of the generator class, you are given a forward pass function that takes in a noise vector and generates an image of the output dimension using your neural network. You are also given a function to create a noise vector. These functions are the same as the ones from the last assignment.
make_gen_block
1  # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
Here’s the test for your generator block:
1  # UNIT TESTS 
Success!
The second component you need to create is the discriminator.
You will use 3 layers in your discriminator’s neural network. Like with the generator, you will need create the function to create a single neural network block for the discriminator.
From the paper, we know that we need to “[u]se LeakyReLU activation in the discriminator for all layers.” And for the LeakyReLUs, “the slope of the leak was set to 0.2” in DCGAN.
There are also tests at the end for you to use.
make_disc_block
1  # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
Here’s a test for your discriminator block:
1  # Test the hidden block 
Success!
Now you can put it all together!
Remember that these are your parameters:
In addition, be warned that this runs very slowly on the default CPU. One way to run this more quickly is to download the .ipynb and upload it to Google Drive, then open it with Google Colab, click on Runtime > Change runtime type
and set hardware accelerator to GPU and replacedevice = "cpu"
withdevice = "cuda"
. The code should then run without any more changes, over 1,000 times faster.
1  criterion = nn.BCEWithLogitsLoss() 
Then, you can initialize your generator, discriminator, and optimizers.
1  gen = Generator(z_dim).to(device) 
Finally, you can train your GAN!
For each epoch, you will process the entire dataset in batches. For every batch, you will update the discriminator and generator. Then, you can see DCGAN’s results!
Here’s roughly the progression you should be expecting. On GPU this takes about 30 seconds per thousand steps. On CPU, this can take about 8 hours per thousand steps. You might notice that in the image of Step 5000, the generator is disproprotionately producing things that look like ones. If the discriminator didn’t learn to detect this imbalance quickly enough, then the generator could just produce more ones. As a result, it may have ended up tricking the discriminator so well that there would be no more improvement, known as mode collapse:
1  n_epochs = 50 
In this notebook, you’re going to create your first generative adversarial network (GAN) for this course! Specifically, you will build and train a GAN that can generate handwritten images of digits (09). You will be using PyTorch in this specialization, so if you’re not familiar with this framework, you may find the PyTorch documentation useful. The hints will also often include links to relevant documentation.
You will begin by importing some useful packages and the dataset you will use to build and train your GAN. You are also provided with a visualizer function to help you investigate the images your GAN will create.
1  import torch 
The training images your discriminator will be using is from a dataset called MNIST. It contains 60,000 images of handwritten digits, from 0 to 9, like these:
You may notice that the images are quite pixelated — this is because they are all only 28 x 28! The small size of its images makes MNIST ideal for simple training. Additionally, these images are also in blackandwhite so only one dimension, or “color channel”, is needed to represent them (more on this later in the course).
You will represent the data using tensors. Tensors are a generalization of matrices: for example, a stack of three matrices with the amounts of red, green, and blue at different locations in a 64 x 64 pixel image is a tensor with the shape 3 x 64 x 64.
Tensors are easy to manipulate and supported by PyTorch, the machine learning library you will be using. Feel free to explore them more, but you can imagine these as multidimensional matrices or vectors!
While you could train your model after generating one image, it is extremely inefficient and leads to less stable training. In GANs, and in machine learning in general, you will process multiple images per training step. These are called batches.
This means that your generator will generate an entire batch of images and receive the discriminator’s feedback on each before updating the model. The same goes for the discriminator, it will calculate its loss on the entire batch of generated images as well as on the reals before the model is updated.
The first step is to build the generator component.
You will start by creating a function to make a single layer/block for the generator’s neural network. Each block should include a linear transformation to map to another shape, a batch normalization for stabilization, and finally a nonlinear activation function (you use a ReLU here) so the output can be transformed in complex ways. You will learn more about activations and batch normalization later in the course.
1  # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # Verify the generator block function 
Success!
Now you can build the generator class. It will take 3 values:
Using these values, the generator will build a neural network with 5 layers/blocks. Beginning with the noise vector, the generator will apply nonlinear transformations via the block function until the tensor is mapped to the size of the image to be outputted (the same size as the real images from MNIST). You will need to fill in the code for final layer since it is different than the others. The final layer does not need a normalization or activation function, but does need to be scaled with a sigmoid function.
Finally, you are given a forward pass function that takes in a noise vector and generates an image of the output dimension using your neural network.
Generator
1  # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # Verify the generator class 
Success!
To be able to use your generator, you will need to be able to create noise vectors. The noise vector z has the important role of making sure the images generated from the same class don’t all look the same — think of it as a random seed. You will generate it randomly using PyTorch by sampling random numbers from the normal distribution. Since multiple images will be processed per pass, you will generate all the noise vectors at once.
Note that whenever you create a new tensor using torch.ones, torch.zeros, or torch.randn, you either need to create it on the target device, e.g. torch.ones(3, 3, device=device)
, or move it onto the target device using torch.ones(3, 3).to(device)
. You do not need to do this if you’re creating a tensor by manipulating another tensor or by using a variation that defaults the device to the input, such as torch.ones_like
. In general, use torch.ones_like
and torch.zeros_like
instead of torch.ones
or torch.zeros
where possible.
get_noise
1  # UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # Verify the noise vector function 
Success!
The second component that you need to construct is the discriminator. As with the generator component, you will start by creating a function that builds a neural network block for the discriminator.
Note: You use leaky ReLUs to prevent the “dying ReLU” problem, which refers to the phenomenon where the parameters stop changing due to consistently negative values passed to a ReLU, which result in a zero gradient. You will learn more about this in the following lectures!
REctified Linear Unit (ReLU)  Leaky ReLU 

1  # UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # Verify the discriminator block function 
Success!
Now you can use these blocks to make a discriminator! The discriminator class holds 2 values:
The discriminator will build a neural network with 4 layers. It will start with the image tensor and transform it until it returns a single number (1dimension tensor) output. This output classifies whether an image is fake or real. Note that you do not need a sigmoid after the output layer since it is included in the loss function. Finally, to use your discrimator’s neural network you are given a forward pass function that takes in an image tensor to be classified.
1  # UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  # Verify the discriminator class 
Success!
Now you can put it all together!
First, you will set your parameters:
Next, you will load the MNIST dataset as tensors using a dataloader.
1  # Set your parameters 
Now, you can initialize your generator, discriminator, and optimizers. Note that each optimizer only takes the parameters of one particular model, since we want each optimizer to optimize only one of the models.
1  gen = Generator(z_dim).to(device) 
Before you train your GAN, you will need to create functions to calculate the discriminator’s loss and the generator’s loss. This is how the discriminator and generator will know how they are doing and improve themselves. Since the generator is needed when calculating the discriminator’s loss, you will need to call .detach() on the generator result to ensure that only the discriminator is updated!
Remember that you have already defined a loss function earlier (criterion
) and you are encouraged to use torch.ones_like
and torch.zeros_like
instead of torch.ones
or torch.zeros
. If you use torch.ones
or torch.zeros
, you’ll need to pass device=device
to them.
1  # UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  def test_disc_reasonable(num_images=10): 
Success!
1  # UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
1  def test_gen_reasonable(num_images=10): 
Success!
Finally, you can put everything together! For each epoch, you will process the entire dataset in batches. For every batch, you will need to update the discriminator and generator using their loss. Batches are sets of images that will be predicted on before the loss functions are calculated (instead of calculating the loss function after each image). Note that you may see a loss to be greater than 1, this is okay since binary cross entropy loss can be any positive number for a sufficiently confident wrong guess.
It’s also often the case that the discriminator will outperform the generator, especially at the start, because its job is easier. It’s important that neither one gets too good (that is, nearperfect accuracy), which would cause the entire model to stop learning. Balancing the two models is actually remarkably hard to do in a standard GAN and something you will see more of in later lectures and assignments.
After you’ve submitted a working version with the original architecture, feel free to play around with the architecture if you want to see how different architectural choices can lead to better or worse GANs. For example, consider changing the size of the hidden dimension, or making the networks shallower or deeper by changing the number of layers.
But remember, don’t expect anything spectacular: this is only the first lesson. The results will get better with later lessons as you learn methods to help keep your generator and discriminator at similar levels.
You should roughly expect to see this progression. On a GPU, this should take about 15 seconds per 500 steps, on average, while on CPU it will take roughly 1.5 minutes:
1  # UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) 
Welcome to this programming assignment! In this notebook, you will:
We will give you the environment and infrastructure to run the experiment and visualize the performance. The assignment will be graded automatically by comparing the behavior of your agent to our implementations of the algorithms. The random seed will be set explicitly to avoid different behaviors due to randomness.
Please go through the cells in order.
In this maze environment, the goal is to reach the goal state (G) as fast as possible from the starting state (S). There are four actions â€“ up, down, right, left â€“ which take the agent deterministically from a state to the corresponding neighboring states, except when movement is blocked by a wall (denoted by grey) or the edge of the maze, in which case the agent remains where it is. The reward is +1 on reaching the goal state, 0 otherwise. On reaching the goal state G, the agent returns to the start state S to being a new episode. This is a discounted, episodic task with $\gamma = 0.95$.
Later in the assignment, we will use a variant of this maze in which a ‘shortcut’ opens up after a certain number of timesteps. We will test if the the DynaQ and DynaQ+ agents are able to find the newlyopened shorter route to the goal state.
We import the following libraries that are required for this assignment. Primarily, we shall be using the following libraries:
Please do not import other libraries as this will break the autograder.
1  %matplotlib inline 
1  plt.rcParams.update({'font.size': 15}) 
Let’s start with a quick recap of the tabular DynaQ algorithm.
DynaQ involves four basic steps:
Steps 1 and 2 are parts of the tabular Qlearning algorithm and are denoted by line numbers (a)â€“(d) in the pseudocode above. Step 3 is performed in line (e), and Step 4 in the block of lines (f).
We highly recommend revising the Dyna videos in the course and the material in the RL textbook (in particular, Section 8.2).
Alright, let’s begin coding.
As you already know by now, you will develop an agent which interacts with the given environment via RLGlue. More specifically, you will implement the usual methods agent_start
, agent_step
, and agent_end
in your DynaQAgent
class, along with a couple of helper methods specific to DynaQ, namely update_model
and planning_step
. We will provide detailed comments in each method describing what your code should do.
Let’s break this down in pieces and do it onebyone.
First of all, check out the agent_init
method below. As in earlier assignments, some of the attributes are initialized with the data passed inside agent_info
. In particular, pay attention to the attributes which are new to DynaQAgent
, since you shall be using them later.
1  #  
Now let’s create the update_model
method, which performs the ‘Model Update’ step in the pseudocode. It takes a (s, a, s', r)
tuple and stores the next state and reward corresponding to a stateaction pair.
Remember, because the environment is deterministic, an easy way to implement the model is to have a dictionary of encountered states, each mapping to a dictionary of actions taken in those states, which in turn maps to a tuple of next state and reward. In this way, the model can be easily accessed by model[s][a]
, which would return the (s', r)
tuple.
1  %%add_to DynaQAgent 
update_model()
1  #  
Next, you will implement the planning step, the crux of the DynaQ algorithm. You shall be calling this planning_step
method at every timestep of every trajectory.
1  %%add_to DynaQAgent 
planning_step()
1  #  
Now before you move on to implement the rest of the agent methods, here are the helper functions that you’ve used in the previous assessments for choosing an action using an $\epsilon$greedy policy.
1  %%add_to DynaQAgent 
Next, you will implement the rest of the agentrelated methods, namely agent_start
, agent_step
, and agent_end
.
1  %%add_to DynaQAgent 
agent_start()
, agent_step()
, and agent_end()
1  #  
Alright. Now we have all the components of the DynaQAgent
ready. Let’s try it out on the maze environment!
The next cell runs an experiment on this maze environment to test your implementation. The initial action values are $0$, the stepsize parameter is $0.125$. and the exploration parameter is $\epsilon=0.1$. After the experiment, the sum of rewards in each episode should match the correct result.
We will try planning steps of $0,5,50$ and compare their performance in terms of the average number of steps taken to reach the goal state in the aforementioned maze environment. For scientific rigor, we will run each experiment $30$ times. In each experiment, we set the initial randomnumbergenerator (RNG) seeds for a fair comparison across algorithms.
1  #  