Attention

reference

course slides and notes from cs224n (http://web.stanford.edu/class/cs224n/)

General definition of attention

Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

We sometimes say that the query attends to the values.
For example, in the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states 75 (values).
- The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.
- Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).

How to do attention

We have some values \(h1\),\(\cdots\),\(h_N\) \(\in \mathbb{R}^{d_1}\) and a query \(s \in \mathbb{R}^{d_2}\)
Computing the attention scores (multiple ways to do this) \[e \in \mathbb{R}^{N}\]
Taking softmax to get attention distribution \(\alpha\) \[\alpha = softmax(e) \in \mathbb{R}^{N}\]
Using attention distribution to take weighted sum of values: \[a = \sum_{i=1}^{N}\alpha_i h_i \in \mathbb{R}^{d_1}\] thus obtaining the attention output a (sometimes called the context vector)

Bidirectional RNNs

picture from lecture notes of cs224n

Bidirectional RNNs fix this problem by traversing a sequence in both directions and concatenating the resulting outputs (both cell outputs and final hidden states). For every RNN cell, we simply add another cell but feed inputs to it in the opposite direction; the output \(O_t\) corresponding to the \(t\prime\) word is the concatenated vector \(\left [ o_t^{(f)}, o_t^{(b)} \right ]\) where \(o_t^{(f)}\) is the output of the forward-direction RNN on word t and \(o_t^{(b)}\) is the corresponding output from the reverse-direction RNN. Similarly, the final hidden state is \(h = \left [ h^{(f)}, h^{(b)} \right ]\).

Seq2Seq

Sequence-to-sequence, or "Seq2Seq", is a relatively new paradigm,with its first published usage in 2014 for English-French translation. At a high level, a sequence-to-sequence model is an end-to-end model made up of two recurrent neural networks: Sutskever et al. 2014, "Sequence to Sequence Learning with Neural Networks" - an encoder, which takes the model’s input sequence as input and encodes it into a fixed-size "context vector" - a decoder, which uses the context vector from above as a "seed" from which to generate an output sequence. For this reason, Seq2Seq models are often referred to as "encoder- decoder models." We’ll look at the details of these two networks separately.

Seq2Seq architecture - encoder

Encoder RNN produces an encoding of the source sentence.

picture from lecture notes of cs224n

The encoder network’s job is to read the input sequence to our Seq2Seq model and generate a fixed-dimensional context vector C for the sequence. To do so, the encoder will use a recurrent neural network cell – usually an LSTM – to read the input tokens one at a time. The final hidden state of the cell will then become C. However, because it’s so difficult to compress an arbitrary-length sequence into a single fixed-size vector (especially for difficult tasks like transla- tion), the encoder will usually consist of stacked LSTMs: a series of LSTM "layers" where each layer’s outputs are the input sequence to the next layer. The final layer’s LSTM hidden state will be used as C.

Seq2Seq encoders will often do something strange: they will pro- cess the input sequence in reverse. This is actually done on purpose. The idea is that, by doing this, the last thing that the encoder sees will (roughly) corresponds to the first thing that the model outputs; this makes it easier for the decoder to "get started" on the output, which makes then gives the decoder an easier time generating a proper output sentence. In the context of translation, we’re allowing the network to translate the first few words of the input as soon as it sees them; once it has the first few words translated correctly, it’s much easier to go on to construct a correct sentence than it is to do so from scratch.

Seq2Seq architecture - decoder

Decoder RNN is a Language Model that generates target sentence, conditioned on encoding.

picture from lecture notes of cs224n

The decoder is also an LSTM network, but its usage is a little more complex than the encoder network. Essentially, we’d like to use it as a language model that’s "aware" of the words that it’s generated so far and of the input. To that end, we’ll keep the "stacked" LSTM architecture from the encoder, but we’ll initialize the hidden state of our first layer with the context vector from above; the decoder will literally use the context of the input to generate an output.

Once the decoder is set up with its context, we’ll pass in a special token to signify the start of output generation; in literature, this is usually an token appended to the end of the input (there’s also one at the end of the output). Then, we’ll run all three layers of LSTM, one after the other, following up with a softmax on the final layer’s output to generate the first output word. Then, we pass that word into the first layer, and repeat the generation. This is how we get the LSTMs to act like a language model. See Fig. 2 for an example of a decoder network.

Once we have the output sequence, we use the same learning strat- egy as usual. We define a loss, the cross entropy on the prediction sequence, and we minimize it with a gradient descent algorithm and back-propagation. Both the encoder and decoder are trained at the same time, so that they both learn the same context vector represen- tation.

Training a Neural Machine Translation system

picture from lecture notes of cs224n

Greedy Search

At each time step, we pick the most probable token. In other words \[x_t = argmax_{\tilde{x_t} \mathbb{P}(\tilde(x_t)| x_1, \cdots, x_t)}\]

This technique is efficient and natural, however it explores a small part of the search space and if we make a mistake at one time step, the rest of the sentence could be heavily impacted.

Beam search decoding

picture from lecture notes of cs224n

the idea is to maintain K candidates at each time step.

\[ H_t = \left\{ (x_1^{1}, \cdots, x_t^1), \cdots, (x_1^k, \cdots, x_t^k) \right\}\]

and compute \(H_{t+1}\) by expanding \(H_t\) and keeping the best K candi- dates. In other words, we pick the best K sequence in the following set

\[\tilde{H_{t+1}} = \cup_{k=1}^{k}H_{t+1}^{\tilde{k}}\]

where \[ \tilde{H_t} = \left\{ (x_1^{k}, \cdots, x_t^{k}, v_1), \cdots, (x_1^{k}, \cdots, x_t^{k}, V_{|v|}) \right\}\]

As we increase K, we gain precision and we are asymptotically exact. However, the improvement is not monotonic and we can set a K that combines reasonable performance and computational efficiency.

CS224n Assignment4

In Machine Translation, our goal is to convert a sentence from the source language (e.g. Spanish) to the target language (e.g. English). In this assignment, we will implement a sequence-to-sequence (Seq2Seq) network with attention, to build a Neural Machine Translation (NMT) system. In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional LSTM Encoder and a Unidirectional LSTM Decoder.

picture from lecture notes of cs224n

Initialize

def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
        """ Init NMT Model.

        @param embed_size (int): Embedding size (dimensionality)
        @param hidden_size (int): Hidden Size (dimensionality)
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        @param dropout_rate (float): Dropout probability, for attention
        """
        super(NMT, self).__init__()
        self.model_embeddings = ModelEmbeddings(embed_size, vocab)
        self.hidden_size = hidden_size
        self.dropout_rate = dropout_rate
        self.vocab = vocab

        # default values
        self.encoder = None 
        self.decoder = None
        self.h_projection = None
        self.c_projection = None
        self.att_projection = None
        self.combined_output_projection = None
        self.target_vocab_projection = None
        self.dropout = None


        ### YOUR CODE HERE (~8 Lines)
        ### TODO - Initialize the following variables:
        ###     self.encoder (Bidirectional LSTM with bias)
        ###     self.decoder (LSTM Cell with bias)
        ###     self.h_projection (Linear Layer with no bias), called W_{h} in the PDF.
        ###     self.c_projection (Linear Layer with no bias), called W_{c} in the PDF.
        ###     self.att_projection (Linear Layer with no bias), called W_{attProj} in the PDF.
        ###     self.combined_output_projection (Linear Layer with no bias), called W_{u} in the PDF.
        ###     self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} in the PDF.
        ###     self.dropout (Dropout Layer)
        ###
        ### Use the following docs to properly initialize these variables:
        ###     LSTM:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
        ###     LSTM Cell:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell
        ###     Linear Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
        ###     Dropout Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout

        self.encoder = nn.LSTM(embed_size, self.hidden_size, dropout=self.dropout_rate,bias = True, bidirectional = True)
        self.decoder = nn.LSTMCell(embed_size + self.hidden_size, self.hidden_size, bias = True)

        self.h_projection = nn.Linear(2 * self.hidden_size, self.hidden_size, bias = False)
        self.c_projection = nn.Linear(2 * self.hidden_size, self.hidden_size, bias = False)
        self.att_projection = nn.Linear(2 * self.hidden_size, self.hidden_size, bias = False)
        self.combined_output_projection = nn.Linear(3 * self.hidden_size, self.hidden_size, bias=False)
        self.target_vocab_projection = nn.Linear(self.hidden_size, self.model_embeddings.target.weight.shape[0])
        self.dropout = nn.Dropout(p = self.dropout_rate)

        ### END YOUR CODE

Encode

Given a sentence in the source language, we look up the word embeddings from an embeddings matrix, yielding \(x_1,\cdots,x_m | x_i \in \mathbb{R}^{e x 1}\), where m is the length of the source sentence and e is the embedding size. We feed these embeddings to the bidirectional Encoder, yielding hidden states and cell states for both the forwards (->) and backwards (<-) LSTMs. The forwards and backwards versions are concatenated to give hidden states \(h_i^{enc}\) and cell states \(c_i^{enc}\)

\[ \begin{align} & h_i^{enc} = \left [ \overleftarrow{h_i^{enc}}; \overrightarrow{h_i^{enc}} \right ] \qquad \text{where} \qquad h_i^{enc} \in \mathbb{R}^{2h x 1}, \overleftarrow{h_i^{enc}}, \overrightarrow{h_i^{enc}} \in \mathbb{R}^{h x 1} \qquad 1 \leq i \leq m \\ & c_i^{enc} = \left [ \overleftarrow{c_i^{enc}}; \overrightarrow{c_i^{enc}} \right ] \qquad \text{where} \qquad c_i^{enc} \in \mathbb{R}^{2h x 1}, \overleftarrow{c_i^{enc}}, \overrightarrow{c_i^{enc}} \in \mathbb{R}^{h x 1} \qquad 1 \leq i \leq m \\ \end{align} \]

We then initialize the Decoder’s first hidden state \(h_0^{dec}\) and cell state \(c_0^{dec}\) with a linear projection of the Encoder’s final hidden state and final cell state

\[ \begin{align} & h_0^{dec} = W_h \left [ \overleftarrow{h_1^{enc}}; \overrightarrow{h_m^{enc}} \right ] \qquad \text{where} \qquad h_0^{dec} \in \mathbb{R}^{h x 1}, W_h \in \mathbb{R}^{h x 2h} \\ & c_0^{dec} = W_h \left [ \overleftarrow{c_1^{enc}}; \overrightarrow{c_m^{enc}} \right ] \qquad \text{where} \qquad c_0^{dec} \in \mathbb{R}^{h x 1}, W_c \in \mathbb{R}^{h x 2h} \\ \end{align} \]

def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """ Apply the encoder to source sentences to obtain encoder hidden states.
            Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

        @param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where
                                        b = batch_size, src_len = maximum source sentence length. Note that 
                                       these have already been sorted in order of longest to shortest sentence.
        @param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
        @returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
                                        b = batch size, src_len = maximum source sentence length, h = hidden size.
        @returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
                                                hidden state and cell.
        """
        enc_hiddens, dec_init_state = None, None

        ### YOUR CODE HERE (~ 8 Lines)
        ### TODO:
        ###     1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
        ###         src_len = maximum source sentence length, b = batch size, e = embedding size. Note
        ###         that there is no initial hidden state or cell for the decoder.
        ###     2. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`.
        ###         - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X.
        ###         - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens.
        ###         - Note that the shape of the tensor returned by the encoder is (src_len b, h*2) and we want to
        ###           return a tensor of shape (b, src_len, h*2) as `enc_hiddens`.
        ###     3. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell):
        ###         - `init_decoder_hidden`:
        ###             `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
        ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the h_projection layer to this in order to compute init_decoder_hidden.
        ###             This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size
        ###         - `init_decoder_cell`:
        ###             `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
        ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
        ###             Apply the c_projection layer to this in order to compute init_decoder_cell.
        ###             This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size
        ###
        ### See the following docs, as you may need to use some of the following functions in your implementation:
        ###     Pack the padded sequence X before passing to the encoder:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence
        ###     Pad the packed sequence, enc_hiddens, returned by the encoder:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_packed_sequence
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tensor Permute:
        ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute

        X = self.model_embeddings.source(source_padded)
        output, (h_enc, c_enc) = self.encoder(
            pack_padded_sequence(X, source_lengths))
        enc_hiddens,sequence_length = pad_packed_sequence(output, batch_first = True) # output of shape (batch, seq_len, num_directions * hidden_size)
        h_0_dec = self.h_projection(torch.cat((h_enc[0,:],h_enc[1,:]), 1))
        c_0_dec = self.c_projection(torch.cat((c_enc[0,:],c_enc[1,:]), 1))
        dec_init_state = (h_0_dec,c_0_dec)

        ### END YOUR CODE

        return enc_hiddens, dec_init_state

Decode

With the Decoder initialized, we must now feed it a matching sentence in the target language. On the \(t^{th}\) step, we look up the embedding for the \(t^{th}\) word, \(y_t \in \mathbb{R}^{e x 1}\), we then concatenate \(y_t\) with the combined-output vector \(O_{t-1} \in \mathbb{R}^{h x 1}\) from the previous step to produce \(\bar{y_t} \in \mathbb{R}^{(e+h) x 1}\). Note that for the first target word \(O_0\) is zero-vector. We then fedd \(\bar{y_t}\) as input to the Decoder LSTM.

\[h_t^{dec}, c_t^{dec} = Decoder(\bar{y_t},h_{t-1}^{dec}, c_{t-1}^{dec} ) \quad \text{where} \quad h_t^{dec} \in \mathbb{R}^{h x 1}\]

We then use \(h_t^{dec}\) to compute multiplicative attention ovev \(h_t^{enc}, \cdots, h_m^{enc}\)

\[\begin{align} & e_{t_i} = (h_t^{dec})^{T}W_{attProj}h_i^{enc} \quad \text{where} \quad e_t \in \mathbb{R}^{m x 1}, W_{attProj} \in \mathbb{R}^{h x 2h} \\ & \alpha_{t} = Softmax(e_t) \quad \text{where} \quad \alpha_t \in \mathbb{R}^{m x 1} \\ & a_t = \sum_i^{m} \alpha_{t,i}h_i^{enc} \quad \text{where} \quad a_t \in \mathbb{R}^{2h x 1}\\ \end{align} \]

We now concatenate the attention output \(a_t\) with the decoder hidden state \(h_t^{dec}\) and pass this through a linear layer, Tanh, and Dropout to attain the combined-output vector \(o_t\)

\[\begin{align} & u_t = \left[ a_t; h_t^{dec} \right ] \quad \text{where} \quad u_t \in \mathbb{R}^{3h x 1} \\ & v_t = W_u u_t \quad \text{where} \quad v_t \in \mathbb{R}^{h x 1}, W_u \in \mathbb{R}^{h x 1} \\ & O_t = Dropout(Tanh(v_t)) \quad \text{where} \quad o_t \in \mathbb{R}^{h x 1} \\ \end{align}\]

Then, we produce a probability distribution \(P_t\) over target words at the \(t^{th}\) timestep: \[P_t = Softmax(W_{vocab}O_t) \quad \text{where} \quad P_t \in \mathbb{R}^{v_t x h} \]

Here, \(V_t\) is the size of the target vocabulary. Finally, to train the network we then compute the softmax cross entropy loss between \(P_t\) and \(g_t\), where \(g_t\) is the 1-hot vector of the target word at timestep t:

\[J(\theta) = CE(P_t, g_t)\]

def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
                dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
        """Compute combined output vectors for a batch.

        @param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
                                     b = batch size, src_len = maximum source sentence length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
                                     b = batch size, src_len = maximum source sentence length.
        @param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
        @param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where
                                       tgt_len = maximum target sentence length, b = batch size. 

        @returns combined_outputs (Tensor): combined output tensor  (tgt_len, b,  h), where
                                        tgt_len = maximum target sentence length, b = batch_size,  h = hidden size
        """
        # Chop of the <END> token for max length sentences.
        target_padded = target_padded[:-1]

        # Initialize the decoder state (hidden and cell)
        dec_state = dec_init_state

        # Initialize previous combined output vector o_{t-1} as zero
        batch_size = enc_hiddens.size(0)
        o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

        # Initialize a list we will use to collect the combined output o_t on each step
        combined_outputs = []

        ### YOUR CODE HERE (~9 Lines)
        ### TODO:
        ###     1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`,
        ###         which should be shape (b, src_len, h),
        ###         where b = batch size, src_len = maximum source length, h = hidden size.
        ###         This is applying W_{attProj} to h^enc, as described in the PDF.
        ###     2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
        ###         where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
        ###     3. Use the torch.split function to iterate over the time dimension of Y.
        ###         Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
        ###             - Squeeze Y_t into a tensor of dimension (b, e). 
        ###             - Construct Ybar_t by concatenating Y_t with o_prev.
        ###             - Use the step function to compute the the Decoder's next (cell, state) values
        ###               as well as the new combined output o_t.
        ###             - Append o_t to combined_outputs
        ###             - Update o_prev to the new o_t.
        ###     4. Use torch.stack to convert combined_outputs from a list length tgt_len of
        ###         tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
        ###         where tgt_len = maximum target sentence length, b = batch size, h = hidden size.
        ###
        ### Note:
        ###    - When using the squeeze() function make sure to specify the dimension you want to squeeze
        ###      over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
        ###   
        ### Use the following docs to implement this functionality:
        ###     Zeros Tensor:
        ###         https://pytorch.org/docs/stable/torch.html#torch.zeros
        ###     Tensor Splitting (iteration):
        ###         https://pytorch.org/docs/stable/torch.html#torch.split
        ###     Tensor Dimension Squeezing:
        ###         https://pytorch.org/docs/stable/torch.html#torch.squeeze
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tensor Stacking:
        ###         https://pytorch.org/docs/stable/torch.html#torch.stack

        #   (b, src_len, h*2) * [2h , h]  = (b, src_len, h)
        enc_hiddens_proj = self.att_projection(enc_hiddens)
        #   (tgt_len, b, e)
        Y = self.model_embeddings.target(target_padded)

        for Y_t in torch.split(Y, split_size_or_sections = 1, dim = 0):
            squeezed_Y_t = torch.squeeze(Y_t) # (b, e) + (b,h) = (b,e+h)
            Ybar_t = torch.cat((o_prev,squeezed_Y_t), dim = 1)
            dec_state, o_t, _ = self.step(Ybar_t,dec_state,enc_hiddens,enc_hiddens_proj,enc_masks)
            combined_outputs.append(o_t)
            o_prev = o_t

        #  (b, h) -> (tgt_len, b, h)
        combined_outputs = torch.stack(combined_outputs,dim = 0)


        ### END YOUR CODE

        return combined_outputs


    def step(self, Ybar_t: torch.Tensor,
            dec_state: Tuple[torch.Tensor, torch.Tensor],
            enc_hiddens: torch.Tensor,
            enc_hiddens_proj: torch.Tensor,
            enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
        """ Compute one forward step of the LSTM decoder, including the attention computation.

        @param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
                                where b = batch size, e = embedding size, h = hidden size.
        @param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
        @param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
                                    src_len = maximum source length, h = hidden size.
        @param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
                                    where b = batch size, src_len = maximum source length, h = hidden size.
        @param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
                                    where b = batch size, src_len is maximum source length. 

        @returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
                First tensor is decoder's new hidden state, second tensor is decoder's new cell.
        @returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
        @returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
                                Note: You will not use this outside of this function.
                                      We are simply returning this value so that we can sanity check
                                      your implementation.
        """

        combined_output = None

        ### YOUR CODE HERE (~3 Lines)
        ### TODO:
        ###     1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state.
        ###     2. Split dec_state into its two parts (dec_hidden, dec_cell)
        ###     3. Compute the attention scores e_t, a Tensor shape (b, src_len). 
        ###        Note: b = batch_size, src_len = maximum source length, h = hidden size.
        ###
        ###       Hints:
        ###         - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched)
        ###         - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched).
        ###         - Use batched matrix multiplication (torch.bmm) to compute e_t.
        ###         - To get the tensors into the right shapes for bmm, you will need to do some squeezing and unsqueezing.
        ###         - When using the squeeze() function make sure to specify the dimension you want to squeeze
        ###             over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
        ###
        ### Use the following docs to implement this functionality:
        ###     Batch Multiplication:
        ###        https://pytorch.org/docs/stable/torch.html#torch.bmm
        ###     Tensor Unsqueeze:
        ###         https://pytorch.org/docs/stable/torch.html#torch.unsqueeze
        ###     Tensor Squeeze:
        ###         https://pytorch.org/docs/stable/torch.html#torch.squeeze

        dec_state = self.decoder(Ybar_t, dec_state)
        h_t_dec, c_t_dec = dec_state
        #  enc_hiddens_proj(b, src_len, h) * h_t_dec (b,h,1) = (b,src_len)
        e_t = torch.squeeze(torch.bmm(enc_hiddens_proj, torch.unsqueeze(h_t_dec,2)),2)


        ### END YOUR CODE

        # Set e_t to -inf where enc_masks has 1
        if enc_masks is not None:
            e_t.data.masked_fill_(enc_masks.byte(), -float('inf'))

        ### YOUR CODE HERE (~6 Lines)
        ### TODO:
        ###     1. Apply softmax to e_t to yield alpha_t
        ###     2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the
        ###         attention output vector, a_t.
        #$$     Hints:
        ###           - alpha_t is shape (b, src_len)
        ###           - enc_hiddens is shape (b, src_len, 2h)
        ###           - a_t should be shape (b, 2h)
        ###           - You will need to do some squeezing and unsqueezing.
        ###     Note: b = batch size, src_len = maximum source length, h = hidden size.
        ###
        ###     3. Concatenate dec_hidden with a_t to compute tensor U_t
        ###     4. Apply the combined output projection layer to U_t to compute tensor V_t
        ###     5. Compute tensor O_t by first applying the Tanh function and then the dropout layer.
        ###
        ### Use the following docs to implement this functionality:
        ###     Softmax:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.functional.softmax
        ###     Batch Multiplication:
        ###        https://pytorch.org/docs/stable/torch.html#torch.bmm
        ###     Tensor View:
        ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
        ###     Tensor Concatenation:
        ###         https://pytorch.org/docs/stable/torch.html#torch.cat
        ###     Tanh:
        ###         https://pytorch.org/docs/stable/torch.html#torch.tanh

        # (b,src_len)
        alpha_t = nn.functional.softmax(e_t, dim = 1) 
        # alpha_t(b,src_len) - (b,1,src_len) * enc_hiddens(b, src_len, h * 2) = (b, 1, h * 2) -> (b,2h)
        a_t = torch.squeeze(torch.bmm(torch.unsqueeze(alpha_t,1),enc_hiddens),1)
        #(b,2h) + (b,h)
        U_t = torch.cat((a_t,h_t_dec), dim = 1)
        V_t = self.combined_output_projection(U_t)
        O_t = self.dropout(nn.functional.tanh(V_t))


        ### END YOUR CODE
        combined_output = O_t
        return dec_state, combined_output, e_t

Helpers

def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
        """ Take a mini-batch of source and target sentences, compute the log-likelihood of
        target sentences under the language models learned by the NMT system.

        @param source (List[List[str]]): list of source sentence tokens
        @param target (List[List[str]]): list of target sentence tokens, wrapped by `<s>` and `</s>`

        @returns scores (Tensor): a variable/tensor of shape (b, ) representing the
                                    log-likelihood of generating the gold-standard target sentence for
                                    each example in the input batch. Here b = batch size.
        """
        # Compute sentence lengths
        source_lengths = [len(s) for s in source]

        # Convert list of lists into tensors
        source_padded = self.vocab.src.to_input_tensor(source, device=self.device)   # Tensor: (src_len, b)
        target_padded = self.vocab.tgt.to_input_tensor(target, device=self.device)   # Tensor: (tgt_len, b)

        ###     Run the network forward:
        ###     1. Apply the encoder to `source_padded` by calling `self.encode()`
        ###     2. Generate sentence masks for `source_padded` by calling `self.generate_sent_masks()`
        ###     3. Apply the decoder to compute combined-output by calling `self.decode()`
        ###     4. Compute log probability distribution over the target vocabulary using the
        ###        combined_outputs returned by the `self.decode()` function.

        enc_hiddens, dec_init_state = self.encode(source_padded, source_lengths)
        enc_masks = self.generate_sent_masks(enc_hiddens, source_lengths)
        combined_outputs = self.decode(enc_hiddens, enc_masks, dec_init_state, target_padded)
        P = F.log_softmax(self.target_vocab_projection(combined_outputs), dim=-1)

        # Zero out, probabilities for which we have nothing in the target text
        target_masks = (target_padded != self.vocab.tgt['<pad>']).float()
        
        # Compute log probability of generating true target words
        target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(-1) * target_masks[1:]
        scores = target_gold_words_log_prob.sum(dim=0)
        return scores

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
CS224N 2018-19: Homework 4
model_embeddings.py: Embeddings for the NMT model
Pencheng Yin <pcyin@cs.cmu.edu>
Sahil Chopra <schopra8@stanford.edu>
Anand Dhoot <anandd@stanford.edu>
"""

import torch.nn as nn

class ModelEmbeddings(nn.Module): 
    """
    Class that converts input words to their embeddings.
    """
    def __init__(self, embed_size, vocab):
        """
        Init the Embedding layers.

        @param embed_size (int): Embedding size (dimensionality)
        @param vocab (Vocab): Vocabulary object containing src and tgt languages
                              See vocab.py for documentation.
        """
        super(ModelEmbeddings, self).__init__()
        self.embed_size = embed_size

        # default values
        self.source = None
        self.target = None

        src_pad_token_idx = vocab.src['<pad>']
        tgt_pad_token_idx = vocab.tgt['<pad>']

        ### YOUR CODE HERE (~2 Lines)
        ### TODO - Initialize the following variables:
        ###     self.source (Embedding Layer for source language)
        ###     self.target (Embedding Layer for target langauge)
        ###
        ### Note:
        ###     1. `vocab` object contains two vocabularies:
        ###            `vocab.src` for source
        ###            `vocab.tgt` for target
        ###     2. You can get the length of a specific vocabulary by running:
        ###             `len(vocab.<specific_vocabulary>)`
        ###     3. Remember to include the padding token for the specific vocabulary
        ###        when creating your Embedding.
        ###
        ### Use the following docs to properly initialize these variables:
        ###     Embedding Layer:
        ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
        self.source = nn.Embedding(len(vocab.src),self.embed_size, padding_idx = src_pad_token_idx)
        self.target = nn.Embedding(len(vocab.tgt), self.embed_size, padding_idx = tgt_pad_token_idx) 
        ### END YOUR CODE

def pad_sents(sents, pad_token):
    """ Pad list of sentences according to the longest sentence in the batch.
    @param sents (list[list[str]]): list of sentences, where each sentence
                                    is represented as a list of words
    @param pad_token (str): padding token
    @returns sents_padded (list[list[str]]): list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentences in the batch now has equal length.
    """
    sents_padded = []

    ### YOUR CODE HERE (~6 Lines)
    max_sentence_len = max([len(s) for s in sents])
    for sent in sents:
        sents_padded.append(sent + [pad_token] * (max_sentence_len - len(sent)))

    ### END YOUR CODE

    return sents_padded