## General definition of attention

Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

• We sometimes say that the query attends to the values.
• For example, in the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states
75 (values).
• The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).

## How to do attention

1. We have some values $h1$,$\cdots$,$h_N$ $\in \mathbb{R}^{d_1}$ and a query $s \in \mathbb{R}^{d_2}$
2. Computing the attention scores (multiple ways to do this)
3. Taking softmax to get attention distribution $\alpha$
4. Using attention distribution to take weighted sum of values:thus obtaining the attention output a (sometimes called the context vector)

## Bidirectional RNNs

picture from lecture notes of cs224n

Bidirectional RNNs fix this problem by traversing a sequence in both directions and concatenating the resulting outputs (both cell outputs and final hidden states). For every RNN cell, we simply add another cell but feed inputs to it in the opposite direction; the output $O_t$ corresponding to the $t\prime$ word is the concatenated vector $\left [ o_t^{(f)}, o_t^{(b)} \right ]$ where $o_t^{(f)}$ is the output of the forward-direction RNN on word t and $o_t^{(b)}$ is the corresponding output from the reverse-direction RNN. Similarly, the final hidden state is $h = \left [ h^{(f)}, h^{(b)} \right ]$.

## Seq2Seq

Sequence-to-sequence, or “Seq2Seq”, is a relatively new paradigm,with its first published usage in 2014 for English-French translation. At a high level, a sequence-to-sequence model is an end-to-end model made up of two recurrent neural networks:
Sutskever et al. 2014, “Sequence to Sequence Learning with Neural Networks”

• an encoder, which takes the model’s input sequence as input and encodes it into a fixed-size “context vector”
• a decoder, which uses the context vector from above as a “seed” from which to generate an output sequence.
For this reason, Seq2Seq models are often referred to as “encoder- decoder models.” We’ll look at the details of these two networks separately.

### Seq2Seq architecture - encoder

Encoder RNN produces an encoding of the source sentence.

picture from lecture notes of cs224n

The encoder network’s job is to read the input sequence to our Seq2Seq model and generate a fixed-dimensional context vector C for the sequence. To do so, the encoder will use a recurrent neural network cell – usually an LSTM – to read the input tokens one at a time. The final hidden state of the cell will then become C. However, because it’s so difficult to compress an arbitrary-length sequence into a single fixed-size vector (especially for difficult tasks like transla- tion), the encoder will usually consist of stacked LSTMs: a series of LSTM “layers” where each layer’s outputs are the input sequence to the next layer. The final layer’s LSTM hidden state will be used as C.

Seq2Seq encoders will often do something strange: they will pro- cess the input sequence in reverse. This is actually done on purpose. The idea is that, by doing this, the last thing that the encoder sees will (roughly) corresponds to the first thing that the model outputs; this makes it easier for the decoder to “get started” on the output, which makes then gives the decoder an easier time generating a proper output sentence. In the context of translation, we’re allowing the network to translate the first few words of the input as soon as it sees them; once it has the first few words translated correctly, it’s much easier to go on to construct a correct sentence than it is to do so from scratch.

### Seq2Seq architecture - decoder

Decoder RNN is a Language Model that generates target sentence, conditioned on encoding.

picture from lecture notes of cs224n

The decoder is also an LSTM network, but its usage is a little more complex than the encoder network. Essentially, we’d like to use it as a language model that’s “aware” of the words that it’s generated so far and of the input. To that end, we’ll keep the “stacked” LSTM architecture from the encoder, but we’ll initialize the hidden state of our first layer with the context vector from above; the decoder will literally use the context of the input to generate an output.

Once the decoder is set up with its context, we’ll pass in a special token to signify the start of output generation; in literature, this is usually an token appended to the end of the input (there’s also one at the end of the output). Then, we’ll run all three layers of LSTM, one after the other, following up with a softmax on the final layer’s output to generate the first output word. Then, we pass that word into the first layer, and repeat the generation. This is how we get the LSTMs to act like a language model. See Fig. 2 for an example of a decoder network.

Once we have the output sequence, we use the same learning strat- egy as usual. We define a loss, the cross entropy on the prediction sequence, and we minimize it with a gradient descent algorithm and back-propagation. Both the encoder and decoder are trained at the same time, so that they both learn the same context vector represen- tation.

## Training a Neural Machine Translation system

picture from lecture notes of cs224n

At each time step, we pick the most probable token. In other words

This technique is efficient and natural, however it explores a small part of the search space and if we make a mistake at one time step, the rest of the sentence could be heavily impacted.

### Beam search decoding

picture from lecture notes of cs224n

the idea is to maintain K candidates at each time step.

and compute $H_{t+1}$ by expanding $H_t$ and keeping the best K candi- dates. In other words, we pick the best K sequence in the following set

where

As we increase K, we gain precision and we are asymptotically exact. However, the improvement is not monotonic and we can set a K that combines reasonable performance and computational efficiency.

## CS224n Assignment4

In Machine Translation, our goal is to convert a sentence from the source language (e.g. Spanish) to the target language (e.g. English). In this assignment, we will implement a sequence-to-sequence (Seq2Seq) network with attention, to build a Neural Machine Translation (NMT) system. In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional LSTM Encoder and a Unidirectional LSTM Decoder.

picture from lecture notes of cs224n

### Encode

Given a sentence in the source language, we look up the word embeddings from an embeddings matrix, yielding $x_1,\cdots,x_m | x_i \in \mathbb{R}^{e x 1}$, where m is the length of the source sentence and e is the embedding size. We feed these embeddings to the bidirectional Encoder, yielding hidden states and cell states for both the forwards (->) and backwards (<-) LSTMs. The forwards and backwards versions are concatenated
to give hidden states $h_i^{enc}$ and cell states $c_i^{enc}$

We then initialize the Decoder’s first hidden state $h_0^{dec}$ and cell state $c_0^{dec}$ with a linear projection of the Encoder’s final hidden state and final cell state

### Decode

With the Decoder initialized, we must now feed it a matching sentence in the target language. On the $t^{th}$ step, we look up the embedding for the $t^{th}$ word, $y_t \in \mathbb{R}^{e x 1}$, we then concatenate $y_t$ with the combined-output vector $O_{t-1} \in \mathbb{R}^{h x 1}$ from the previous step to produce $\bar{y_t} \in \mathbb{R}^{(e+h) x 1}$. Note that for the first target word $O_0$ is zero-vector. We then fedd $\bar{y_t}$ as input to the Decoder LSTM.

We then use $h_t^{dec}$ to compute multiplicative attention ovev $h_t^{enc}, \cdots, h_m^{enc}$

We now concatenate the attention output $a_t$ with the decoder hidden state $h_t^{dec}$ and pass this through a linear layer, Tanh, and Dropout to attain the combined-output vector $o_t$

Then, we produce a probability distribution $P_t$ over target words at the $t^{th}$ timestep:

Here, $V_t$ is the size of the target vocabulary. Finally, to train the network we then compute the softmax cross entropy loss between $P_t$ and $g_t$, where $g_t$ is the 1-hot vector of the target word at timestep t:

## reference

1. course slides and notes from cs224n (http://web.stanford.edu/class/cs224n/)
Donate article here