# Character level language model - Dinosaurus land

Welcome to Dinosaurus Island! 65 million years ago, dinosaurs existed, and in this assignment they are back. You are in charge of a special task. Leading biology researchers are creating new breeds of dinosaurs and bringing them to life on earth, and your job is to give names to these dinosaurs. If a dinosaur does not like its name, it might go beserk, so choose wisely!

Luckily you have learned some deep learning and you will use it to save the day. Your assistant has collected a list of all the dinosaur names they could find, and compiled them into this dataset. (Feel free to take a look by clicking the previous link.) To create new dinosaur names, you will build a character level language model to generate new names. Your algorithm will learn the different name patterns, and randomly generate new names. Hopefully this algorithm will keep you and your team safe from the dinosaurs’ wrath!

By completing this assignment you will learn:

- How to store text data for processing using an RNN
- How to synthesize data, by sampling predictions at each time step and passing it to the next RNN-cell unit
- How to build a character-level text generation recurrent neural network
- Why clipping the gradients is important

We will begin by loading in some functions that we have provided for you in `rnn_utils`

. Specifically, you have access to functions such as `rnn_forward`

and `rnn_backward`

which are equivalent to those you’ve implemented in the previous assignment.

1 | import numpy as np |

## 1 - Problem Statement

### 1.1 - Dataset and Preprocessing

Run the following cell to read the dataset of dinosaur names, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size.

1 | data = open('dinos.txt', 'r').read() |

```
There are 19909 total characters and 27 unique characters in your data.
```

The characters are a-z (26 characters) plus the “\n” (or newline character), which in this assignment plays a role similar to the `<EOS>`

(or “End of sentence”) token we had discussed in lecture, only here it indicates the end of the dinosaur name rather than the end of a sentence. In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character character. This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, `char_to_ix`

and `ix_to_char`

are the python dictionaries.

1 | char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) } |

```
{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}
```

### 1.2 - Overview of the model

Your model will have the following structure:

- Initialize parameters
- Run the optimization loop
- Forward propagation to compute the loss function
- Backward propagation to compute the gradients with respect to the loss function
- Clip the gradients to avoid exploding gradients
- Using the gradients, update your parameter with the gradient descent update rule.

- Return the learned parameters

At each time-step, the RNN tries to predict what is the next character given the previous characters. The dataset $X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, …, x^{\langle T_x \rangle})$ is a list of characters in the training set, while $Y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, …, y^{\langle T_x \rangle})$ is such that at every time-step $t$, we have $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$.

## 2 - Building blocks of the model

In this part, you will build two important blocks of the overall model:

- Gradient clipping: to avoid exploding gradients
- Sampling: a technique used to generate characters

You will then apply these two functions to build the model.

### 2.1 - Clipping the gradients in the optimization loop

In this section you will implement the `clip`

function that you will call inside of your optimization loop. Recall that your overall loop structure usually consists of a forward pass, a cost computation, a backward pass, and a parameter update. Before updating the parameters, you will perform gradient clipping when needed to make sure that your gradients are not “exploding,” meaning taking on overly large values.

In the exercise below, you will implement a function `clip`

that takes in a dictionary of gradients and returns a clipped version of gradients if needed. There are different ways to clip gradients; we will use a simple element-wise clipping procedure, in which every element of the gradient vector is clipped to lie between some range [-N, N]. More generally, you will provide a `maxValue`

(say 10). In this example, if any component of the gradient vector is greater than 10, it would be set to 10; and if any component of the gradient vector is less than -10, it would be set to -10. If it is between -10 and 10, it is left alone.

**Exercise**: Implement the function below to return the clipped gradients of your dictionary `gradients`

. Your function takes in a maximum threshold and returns the clipped versions of your gradients. You can check out this hint for examples of how to clip in numpy. You will need to use the argument `out = ...`

.

1 | ### GRADED FUNCTION: clip |

1 | np.random.seed(3) |

```
gradients["dWaa"][1][2] = 10.0
gradients["dWax"][3][1] = -10.0
gradients["dWya"][1][2] = 0.29713815361
gradients["db"][4] = [ 10.]
gradients["dby"][1] = [ 8.45833407]
```

** Expected output:**

**gradients["dWaa"][1][2] ** | 10.0 |

**gradients["dWax"][3][1]** | -10.0 |

**gradients["dWya"][1][2]** | 0.29713815361 |

**gradients["db"][4]** | [ 10.] |

**gradients["dby"][1]** | [ 8.45833407] |

### 2.2 - Sampling

Now assume that your model is trained. You would like to generate new text (characters). The process of generation is explained in the picture below:

**Exercise**: Implement the `sample`

function below to sample characters. You need to carry out 4 steps:

**Step 1**: Pass the network the first “dummy” input $x^{\langle 1 \rangle} = \vec{0}$ (the vector of zeros). This is the default input before we’ve generated any characters. We also set $a^{\langle 0 \rangle} = \vec{0}$**Step 2**: Run one step of forward propagation to get $a^{\langle 1 \rangle}$ and $\hat{y}^{\langle 1 \rangle}$. Here are the equations:

Note that $\hat{y}^{\langle t+1 \rangle }$ is a (softmax) probability vector (its entries are between 0 and 1 and sum to 1). $\hat{y}^{\langle t+1 \rangle}_i$ represents the probability that the character indexed by “i” is the next character. We have provided a `softmax()`

function that you can use.

**Step 3**: Carry out sampling: Pick the next character’s index according to the probability distribution specified by $\hat{y}^{\langle t+1 \rangle }$. This means that if $\hat{y}^{\langle t+1 \rangle }_i = 0.16$, you will pick the index “i” with 16% probability. To implement it, you can use`np.random.choice`

.

Here is an example of how to use `np.random.choice()`

:

1 | np.random.seed(0) |

This means that you will pick the `index`

according to the distribution:

$P(index = 0) = 0.1, P(index = 1) = 0.0, P(index = 2) = 0.7, P(index = 3) = 0.2$.

**Step 4**: The last step to implement in`sample()`

is to overwrite the variable`x`

, which currently stores $x^{\langle t \rangle }$, with the value of $x^{\langle t + 1 \rangle }$. You will represent $x^{\langle t + 1 \rangle }$ by creating a one-hot vector corresponding to the character you’ve chosen as your prediction. You will then forward propagate $x^{\langle t + 1 \rangle }$ in Step 1 and keep repeating the process until you get a “\n” character, indicating you’ve reached the end of the dinosaur name.

1 | # GRADED FUNCTION: sample |

1 | np.random.seed(2) |

```
Sampling:
list of sampled indices: [12, 17, 24, 14, 13, 9, 10, 22, 24, 6, 13, 11, 12, 6, 21, 15, 21, 14, 3, 2, 1, 21, 18, 24, 7, 25, 6, 25, 18, 10, 16, 2, 3, 8, 15, 12, 11, 7, 1, 12, 10, 2, 7, 7, 11, 5, 6, 12, 25, 0, 0]
list of sampled characters: ['l', 'q', 'x', 'n', 'm', 'i', 'j', 'v', 'x', 'f', 'm', 'k', 'l', 'f', 'u', 'o', 'u', 'n', 'c', 'b', 'a', 'u', 'r', 'x', 'g', 'y', 'f', 'y', 'r', 'j', 'p', 'b', 'c', 'h', 'o', 'l', 'k', 'g', 'a', 'l', 'j', 'b', 'g', 'g', 'k', 'e', 'f', 'l', 'y', '\n', '\n']
```

** Expected output:**

**list of sampled indices:** |
[12, 17, 24, 14, 13, 9, 10, 22, 24, 6, 13, 11, 12, 6, 21, 15, 21, 14, 3, 2, 1, 21, 18, 24, 7, 25, 6, 25, 18, 10, 16, 2, 3, 8, 15, 12, 11, 7, 1, 12, 10, 2, 7, 7, 11, 5, 6, 12, 25, 0, 0] |

**list of sampled characters:** |
['l', 'q', 'x', 'n', 'm', 'i', 'j', 'v', 'x', 'f', 'm', 'k', 'l', 'f', 'u', 'o', 'u', 'n', 'c', 'b', 'a', 'u', 'r', 'x', 'g', 'y', 'f', 'y', 'r', 'j', 'p', 'b', 'c', 'h', 'o', 'l', 'k', 'g', 'a', 'l', 'j', 'b', 'g', 'g', 'k', 'e', 'f', 'l', 'y', '\n', '\n'] |

## 3 - Building the language model

It is time to build the character-level language model for text generation.

### 3.1 - Gradient descent

In this section you will implement a function performing one step of stochastic gradient descent (with clipped gradients). You will go through the training examples one at a time, so the optimization algorithm will be stochastic gradient descent. As a reminder, here are the steps of a common optimization loop for an RNN:

- Forward propagate through the RNN to compute the loss
- Backward propagate through time to compute the gradients of the loss with respect to the parameters
- Clip the gradients if necessary
- Update your parameters using gradient descent

**Exercise**: Implement this optimization process (one step of stochastic gradient descent).

We provide you with the following functions:

1 | def rnn_forward(X, Y, a_prev, parameters): |

1 | # GRADED FUNCTION: optimize |

1 | np.random.seed(1) |

```
Loss = 126.503975722
gradients["dWaa"][1][2] = 0.194709315347
np.argmax(gradients["dWax"]) = 93
gradients["dWya"][1][2] = -0.007773876032
gradients["db"][4] = [-0.06809825]
gradients["dby"][1] = [ 0.01538192]
a_last[4] = [-1.]
```

** Expected output:**

**Loss ** | 126.503975722 |

**gradients["dWaa"][1][2]** | 0.194709315347 |

**np.argmax(gradients["dWax"])** | 93 |

**gradients["dWya"][1][2]** | -0.007773876032 |

**gradients["db"][4]** | [-0.06809825] |

**gradients["dby"][1]** | [ 0.01538192] |

**a_last[4]** | [-1.] |

### 3.2 - Training the model

Given the dataset of dinosaur names, we use each line of the dataset (one name) as one training example. Every 100 steps of stochastic gradient descent, you will sample 10 randomly chosen names to see how the algorithm is doing. Remember to shuffle the dataset, so that stochastic gradient descent visits the examples in random order.

**Exercise**: Follow the instructions and implement `model()`

. When `examples[index]`

contains one dinosaur name (string), to create an example (X, Y), you can use this:

1 | index = j % len(examples) |

Note that we use: `index= j % len(examples)`

, where `j = 1....num_iterations`

, to make sure that `examples[index]`

is always a valid statement (`index`

is smaller than `len(examples)`

).

The first entry of `X`

being `None`

will be interpreted by `rnn_forward()`

as setting $x^{\langle 0 \rangle} = \vec{0}$. Further, this ensures that `Y`

is equal to `X`

but shifted one step to the left, and with an additional “\n” appended to signify the end of the dinosaur name.

1 | # GRADED FUNCTION: model |

Run the following cell, you should observe your model outputting random-looking characters at the first iteration. After a few thousand iterations, your model should learn to generate reasonable-looking names.

1 | parameters = model(data, ix_to_char, char_to_ix) |

```
Iteration: 0, Loss: 23.087336
Nkzxwtdmfqoeyhsqwasjkjvu
Kneb
Kzxwtdmfqoeyhsqwasjkjvu
Neb
Zxwtdmfqoeyhsqwasjkjvu
Eb
Xwtdmfqoeyhsqwasjkjvu
Iteration: 2000, Loss: 27.884160
Liusskeomnolxeros
Hmdaairus
Hytroligoraurus
Lecalosapaus
Xusicikoraurus
Abalpsamantisaurus
Tpraneronxeros
Iteration: 4000, Loss: 25.901815
Mivrosaurus
Inee
Ivtroplisaurus
Mbaaisaurus
Wusichisaurus
Cabaselachus
Toraperlethosdarenitochusthiamamumamaon
Iteration: 6000, Loss: 24.608779
Onwusceomosaurus
Lieeaerosaurus
Lxussaurus
Oma
Xusteonosaurus
Eeahosaurus
Toreonosaurus
Iteration: 8000, Loss: 24.070350
Onxusichepriuon
Kilabersaurus
Lutrodon
Omaaerosaurus
Xutrcheps
Edaksoje
Trodiktonus
Iteration: 10000, Loss: 23.844446
Onyusaurus
Klecalosaurus
Lustodon
Ola
Xusodonia
Eeaeosaurus
Troceosaurus
Iteration: 12000, Loss: 23.291971
Onyxosaurus
Kica
Lustrepiosaurus
Olaagrraiansaurus
Yuspangosaurus
Eealosaurus
Trognesaurus
Iteration: 14000, Loss: 23.382339
Meutromodromurus
Inda
Iutroinatorsaurus
Maca
Yusteratoptititan
Ca
Troclosaurus
Iteration: 16000, Loss: 23.288447
Meuspsangosaurus
Ingaa
Iusosaurus
Macalosaurus
Yushanis
Daalosaurus
Trpandon
Iteration: 18000, Loss: 22.823526
Phytrolonhonyg
Mela
Mustrerasaurus
Peg
Ytronorosaurus
Ehalosaurus
Trolomeehus
Iteration: 20000, Loss: 23.041871
Nousmofonosaurus
Loma
Lytrognatiasaurus
Ngaa
Ytroenetiaudostarmilus
Eiafosaurus
Troenchulunosaurus
Iteration: 22000, Loss: 22.728849
Piutyrangosaurus
Midaa
Myroranisaurus
Pedadosaurus
Ytrodon
Eiadosaurus
Trodoniomusitocorces
Iteration: 24000, Loss: 22.683403
Meutromeisaurus
Indeceratlapsaurus
Jurosaurus
Ndaa
Yusicheropterus
Eiaeropectus
Trodonasaurus
Iteration: 26000, Loss: 22.554523
Phyusaurus
Liceceron
Lyusichenodylus
Pegahus
Yustenhtonthosaurus
Elagosaurus
Trodontonsaurus
Iteration: 28000, Loss: 22.484472
Onyutimaerihus
Koia
Lytusaurus
Ola
Ytroheltorus
Eiadosaurus
Trofiashates
Iteration: 30000, Loss: 22.774404
Phytys
Lica
Lysus
Pacalosaurus
Ytrochisaurus
Eiacosaurus
Trochesaurus
Iteration: 32000, Loss: 22.209473
Mawusaurus
Jica
Lustoia
Macaisaurus
Yusolenqtesaurus
Eeaeosaurus
Trnanatrax
Iteration: 34000, Loss: 22.396744
Mavptokekus
Ilabaisaurus
Itosaurus
Macaesaurus
Yrosaurus
Eiaeosaurus
Trodon
```

## Conclusion

You can see that your algorithm has started to generate plausible dinosaur names towards the end of the training. At first, it was generating random characters, but towards the end you could see dinosaur names with cool endings. Feel free to run the algorithm even longer and play with hyperparameters to see if you can get even better results. Our implemetation generated some really cool names like `maconucon`

, `marloralus`

and `macingsersaurus`

. Your model hopefully also learned that dinosaur names tend to end in `saurus`

, `don`

, `aura`

, `tor`

, etc.

If your model generates some non-cool names, don’t blame the model entirely—not all actual dinosaur names sound cool. (For example, `dromaeosauroides`

is an actual dinosaur name and is in the training set.) But this model should give you a set of candidates from which you can pick the coolest!

This assignment had used a relatively small dataset, so that you could train an RNN quickly on a CPU. Training a model of the english language requires a much bigger dataset, and usually needs much more computation, and could run for many hours on GPUs. We ran our dinosaur name for quite some time, and so far our favoriate name is the great, undefeatable, and fierce: Mangosaurus!

## 4 - Writing like Shakespeare

The rest of this notebook is optional and is not graded, but we hope you’ll do it anyway since it’s quite fun and informative.

A similar (but more complicated) task is to generate Shakespeare poems. Instead of learning from a dataset of Dinosaur names you can use a collection of Shakespearian poems. Using LSTM cells, you can learn longer term dependencies that span many characters in the text—e.g., where a character appearing somewhere a sequence can influence what should be a different character much much later in ths sequence. These long term dependencies were less important with dinosaur names, since the names were quite short.

We have implemented a Shakespeare poem generator with Keras. Run the following cell to load the required packages and models. This may take a few minutes.

1 | from __future__ import print_function |

To save you some time, we have already trained a model for ~1000 epochs on a collection of Shakespearian poems called *“The Sonnets”*.

Let’s train the model for one more epoch. When it finishes training for an epoch—-this will also take a few minutes—-you can run `generate_output`

, which will prompt asking you for an input (`<`

40 characters). The poem will start with your sentence, and our RNN-Shakespeare will complete the rest of the poem for you! For example, try “Forsooth this maketh no sense “ (don’t enter the quotation marks). Depending on whether you include the space at the end, your results might also differ—try it both ways, and try other inputs as well.

1 | print_callback = LambdaCallback(on_epoch_end=on_epoch_end) |

1 | # Run this cell to try with different inputs without having to re-train the model |

The RNN-Shakespeare model is very similar to the one you have built for dinosaur names. The only major differences are:

- LSTMs instead of the basic RNN to capture longer-range dependencies
- The model is a deeper, stacked LSTM model (2 layer)
- Using Keras instead of python to simplify the code

If you want to learn more, you can also check out the Keras Team’s text generation implementation on GitHub: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py.

Congratulations on finishing this notebook!

**References**:

- This exercise took inspiration from Andrej Karpathy’s implementation: https://gist.github.com/karpathy/d4dee566867f8291f086. To learn more about text generation, also check out Karpathy’s blog post.
- For the Shakespearian poem generator, our implementation was based on the implementation of an LSTM text generator by the Keras team: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py

1 |