Subword Models

Character-Level Models

Word embeddings can be composed from character embeddings
- Generates embeddings for unknown words
- Similar spellings share similar embeddings
- Solves OOV problem
Motivation
- Derive a powerful,robust language model effective across a variety of languages.
- Encode subword relatedness:eventful,eventfully, uneventful...
- Address rare-word problem of prior models.
- Obtain comparable expressivity with fewer parameters.

Two trends

Same architecture as forword-level model but use smaller units: “word pieces”
Hybrid architectures: Main model has words; something else for characters

Byte Pair Encoding

A word segmentation algorithm: - Start with a vocabulary of characters - Most frequent ngram pairs -> a new ngram - Have a target vocabulary size and stop when you reach it - Do deterministic longest piece segmentation of words - Segmentation is only within words identified by some prior tokenizer

For example, all the words in our documents database and their frequency are > {'l o w': 5, 'l o w e r': 2, 'n e w e s t': 6, 'w i d e s t': 3}

We can initialize our vocabulary library as:
> { 'l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't', 'i', 'd'}

The most frequent ngram pair is ('e','s') and its count is 9. So we add the 'es' to our vocabulary library.

Our documents database now is: > {'l o w': 5, 'l o w e r': 2, 'n e w es t': 6, 'w i d es t': 3}.

Our vocabulary library now is: > { 'l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't', 'i', 'd', 'es'}

Again, the most frequent ngram pair is ('es','t') and its count is 9，So we add the 'est' to our vocabulary library.

Our documents database now is: > {'l o w': 5, 'l o w e r': 2, 'n e w est': 6, 'w i d est': 3}

Our vocabulary library now is: > { 'l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't', 'i', 'd', 'es','est'}

the rest can be done in the same manner. We can set a threshold of total count of our vocabulary library. By doing so, we can use BPE to construct a vocabulary library to represent all the words based on subword unit.

Google NMT(GNMT) uses a variant of this: - V1: wordpiece model (Word piece model tokenizes inside words) - V2: sentencepiece model (Sentence piece model works from raw text)

Character-level to build word-level

Convolution over characters to generate word embeddings
Character-based LSTM to build word representation

CS224n Assignment5

Character-based convolutional encoder for NMT.

Figure from cs224n. Character-based convolutional encoder, which ultimately produces a word embedding of length \(e_{word}\)

Convert word to character indices. We have a word \(x\) (e.g. Anarchy in above figure) that we wish to represent. Assume we have a predefined ‘vocabulary’ of characters (for example, all lowercase letters, uppercase letters, numbers, and some punctuation). By looking up the index of each character, we can thus represent the length-l word x as a vector of integers: \[x = \left[ c_1,c_2,\cdots,c_l \right ] \in \mathbb{Z}^{l}\] where each \(c_i\) is an integer index into the character vocabulary.

Padding and embedding lookup. Using a special ‘character’, we pad (or truncate) every word so that it has length \(m_word\) (this is some predefined hyperparameter representing maximum word length): \[x_{padded} = \left [ c_1,c_2,\cdots,c_{m_{word}} \right ] \in \mathbb{Z}^{m_{word}}\]

def pad_sents_char(sents, char_pad_token):
    """ Pad list of sentences according to the longest sentence in the batch and max_word_length.
    @param sents (list[list[list[int]]]): list of sentences, result of `words2charindices()`
        from `vocab.py`
    @param char_pad_token (int): index of the character-padding token
    @returns sents_padded (list[list[list[int]]]): list of sentences where sentences/words shorter
        than the max length sentence/word are padded out with the appropriate pad token, such that
        each sentence in the batch now has same number of words and each word has an equal
        number of characters
        Output shape: (batch_size, max_sentence_length, max_word_length)
    """
    # Words longer than 21 characters should be truncated
    max_word_length = 21

    ### YOUR CODE HERE for part 1f
    ### TODO:
    ###     Perform necessary padding to the sentences in the batch similar to the pad_sents()
    ###     method below using the padding character from the arguments. You should ensure all
    ###     sentences have the same number of words and each word has the same number of
    ###     characters.
    ###     Set padding words to a `max_word_length` sized vector of padding characters.
    ###
    ###     You should NOT use the method `pad_sents()` below because of the way it handles
    ###     padding and unknown words.
    sents_padded = []
    max_sent_len = max([len(sent) for sent in sents])
    for sent in sents:
        sent = sent + [[] for i in range(max_sent_len - len(sent))]
        assert len(sent) == max_sent_len
        tmp_sent = []
        for word in sent:
            word = word[:max_word_length]
            diff = max_word_length - len(word)
            word = word + [char_pad_token] * diff
            assert len(word) == max_word_length
            tmp_sent.append(word)
        sents_padded.append(tmp_sent)

    ### END YOUR CODE

    return sents_padded

class VocabEntry(object):
    # ......
    def words2charindices(self, sents):
            """ Convert list of sentences of words into list of list of list of character indices.
            @param sents (list[list[str]]): sentence(s) in words
            @return word_ids (list[list[list[int]]]): sentence(s) in indices
            """
            ### YOUR CODE HERE for part 1e
            ### TODO: 
            ###     This method should convert characters in the input sentences into their 
            ###     corresponding character indices using the character vocabulary char2id 
            ###     defined above.
            ###
            ###     You must prepend each word with the `start_of_word` character and append 
            ###     with the `end_of_word` character. 

            word_ids = []
            for sent in sents:
                sent_chars_id = []
                for word in sent:
                    sent_chars_id.append([1] + [ self.char2id.get(char,3) for char in word ] + [2])
                word_ids.append(sent_chars_id)

            return word_ids

     def to_input_tensor_char(self, sents: List[List[str]], device: torch.device) -> torch.Tensor:
        """ Convert list of sentences (words) into tensor with necessary padding for 
        shorter sentences.

        @param sents (List[List[str]]): list of sentences (words)
        @param device: device on which to load the tensor, i.e. CPU or GPU

        @returns sents_var: tensor of (max_sentence_length, batch_size, max_word_length)
        """
        ### YOUR CODE HERE for part 1g
        ### TODO: 
        ###     Connect `words2charindices()` and `pad_sents_char()` which you've defined in 
        ###     previous parts
        char_sents =  self.words2charindices(sents)
        padded_char_sents = pad_sents_char(char_sents, 0)

        sents_var = torch.tensor(padded_char_sents, dtype=torch.int8, device= device)
        sents_var = sents_var.permute(1,0,2)

        return sents_var

    # ......

For each of these characters \(c_i\), we lookup a dense character embedding (which has shape \(e_{char}\)). This yields a tensor \(x_{emb}\): \[x_{emb} = CharEmbedding(X_{padded}) \in \mathbb{R}^{m_{word} \times e_{char}}\] We'll reshape \(x_{emb}\) to obtain \(x_{reshaped} in \mathbb{R}^{e_{char} \times m_{word}}\) before feeding into the convolutional network.

Convolutional network. To combine these character embeddings, we’ll use 1-dimensional convolutions. The convolutional layer has two hyperparameters: the kernel size \(k\) (also called window size), which dictates the size of the window used to compute features, and the number of filters \(f\), (also called number of output features or number of output channels). The convolutional layer has a weight matrix \(W \in \mathbb{R}^{f \times e_{char} \times k}\) and a bias vector \(b \in \mathbb{R}^{f}\). Overall this produces output \(x_{conv}\). \[x_{conv} = Conv1D(x_{reshaped}) \in \mathbb{R}^{f \times {m_{word} - k + 1}}\] For our application, we’ll set \(f\) to be equal to \(e_{word}\), the size of the final word embedding for word x. Therefore, \[x_{conv} \in \mathbb{R}^{e_{word} \times (m_{word} - k + 1)}\] Finally, we apply the ReLU function to \(x_{conv}\), then use max-pooling to reduce this to a single vector \(x_{conv_out} \in \mathbb{R}^{e_{word}}\), which is the final output of the Convolutional Network: \[x_{conv\_out} = MaxPool(ReLU(x_{conv})) \in \mathbb{R}^{e_{word}}\]

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
CS224N 2018-19: Homework 5
"""

### YOUR CODE HERE for part 1i

import torch
import torch.nn as nn
import torch.nn.functional as F



### END YOUR CODE

class CNN(nn.Module):
    """ 
        Init CNN which is a 1-D cnn.
        @param embed_size (int): embedding size of char (dimensionality)
        @param k: kernel size, also called window size
        @param f: number of filters, should be embed_size of word
        """

    def __init__(self,
            embed_size: int = 50,
                 m_word: int = 21,
                 k: int = 5,
                 f: int = 50):

        """ 
        Init CNN which is a 1-D cnn.
        @param embed_size (int): embedding size of char (dimensionality)
        @param k: kernel size, also called window size
        @param f: number of filters, should be embed_size of word
        """

        super(CNN, self).__init__()
        self.conv1d = nn.Conv1d(in_channels = embed_size, 
                    out_channels = f,
                    kernel_size = k)

        self.maxpool = nn.MaxPool1d(kernel_size = m_word - k + 1)


    def forward(self, X_reshaped: torch.Tensor) -> torch.Tensor:
        """
        map from X_reshaped to X_conv_out
        @param X_reshaped (Tensor): Tensor of char-level embedding with shape ( 
                                    batch_size, e_char, m_word), where e_char = embed_size of char, 
                                    m_word = max_word_length.
        @return X_conv_out (Tensor): Tensor of word-level embedding with shape (max_sentence_length,
                                    batch_size)
        """
        X_conv = self.conv1d(X_reshaped)
        # print(X_conv.size())
        X_conv_out = self.maxpool(F.relu(X_conv))
        # print(X_conv_out.size())

        return X_conv_out.squeeze(-1)

Highway layer and dropout. Highway Networks6 have a skip-connection controlled by a dynamic gate. Given the input \(x_{conv\\_out} \in \mathbb{R}^{e_{word}}\), we compute:

Figure from cs224n. Highway Network (Srivastava et al. 2015)

\[ \begin{align} & x_{proj} = RELU(W_{proj}x_{conv\_cout} + b_{proj}) \quad \in \mathbb{R}^{e_{word}} \\ & x_{gate} = \sigma(W_{gate}x_{conv\_out} + b_{gate}) \quad \in \mathbb{R}^{e_{word}} \\ & x_{highway} = x_{gate} \circ x_{proj} + ( 1 - x_{gate}) \circ x_{conv\_out} \quad \in \mathbb{R}^{e_{word}}\\ & x_{word_emb} = Dropout(x_{highway}) \quad \in \mathbb{R}^{e_{word}} \end{align} \] Where \(W_{proj},W_{gate} \in \mathbb{R}^{e_{word} \times e_{word}}\), and \(\circ\) denotes element-wise multiplication.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
CS224N 2018-19: Homework 5
"""
import torch
import torch.nn as nn
import torch.nn.functional as F

### YOUR CODE HERE for part 1h

class Highway(nn.Module):
    """ Highway network for ConvNN
        - Relu
        - Sigmoid
        - gating mechanism from LSTM
    """

    def __init__(self,embed_size):
        """ Init Higway network
            @param embed_size (int): Embedding size of word, in handout, 
                                     it's e_{word} (dimensionality)
        """

        super(Highway, self).__init__()

        self.projection = nn.Linear(embed_size,embed_size,bias = True)
        self.gate = nn.Linear(embed_size,embed_size, bias = True)
        

    def forward(self, X_conv_out):
        """
            Take mini-batch of sentence of ConvNN
            @param X_conv_out (Tensor): Tensor with shape (max_sentence_length, batch_size, embed_size)
            @return X_highway (Tensor): combinded output with shape (max_sentence_length, batch_size, embed_size)
        """
        
        X_proj = F.relu(self.projection(X_conv_out))
        X_gate = torch.sigmoid(self.gate(X_conv_out))
        X_highway =  torch.mul(X_gate, X_proj) + torch.mul((1 - X_gate),X_conv_out)

        return X_highway


### END YOUR CODE

Combine above steps together to get our Character-based word embedding model.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
CS224N 2018-19: Homework 5
model_embeddings.py: Embeddings for the NMT model
Pencheng Yin <pcyin@cs.cmu.edu>
Sahil Chopra <schopra8@stanford.edu>
Anand Dhoot <anandd@stanford.edu>
Michael Hahn <mhahn2@stanford.edu>
"""

import torch.nn as nn
import torch

# Do not change these imports; your module names should be
#   `CNN` in the file `cnn.py`
#   `Highway` in the file `highway.py`
# Uncomment the following two imports once you're ready to run part 1(j)

from cnn import CNN
from highway import Highway

# End "do not change" 

class ModelEmbeddings(nn.Module): 
    """
    Class that converts input words to their CNN-based embeddings.
    """
    def __init__(self, embed_size, vocab):
        """
        Init the Embedding layer for one language
        @param embed_size (int): Embedding size (dimensionality) for the output 
        @param vocab (VocabEntry): VocabEntry object. See vocab.py for documentation.
        """
        super(ModelEmbeddings, self).__init__()

        ## A4 code
        # pad_token_idx = vocab.src['<pad>']
        # self.embeddings = nn.Embedding(len(vocab.src), embed_size, padding_idx=pad_token_idx)
        ## End A4 code

        ### YOUR CODE HERE for part 1j
        pad_token_idx = vocab.char2id['<pad>']
        self.embed_size = embed_size
        char_embed_size = 50
        self.char_embedding = nn.Embedding(len(vocab.char2id),
                                           char_embed_size,
                                           pad_token_idx)
        self.convNN = CNN(f=self.embed_size)
        self.highway = Highway(embed_size=self.embed_size)
        self.dropout = nn.Dropout(p=0.3)

        ### END YOUR CODE

    def forward(self, input):
        """
        Looks up character-based CNN embeddings for the words in a batch of sentences.
        @param input: Tensor of integers of shape (sentence_length, batch_size, max_word_length) where
            each integer is an index into the character vocabulary

        @param output: Tensor of shape (sentence_length, batch_size, embed_size), containing the 
            CNN-based embeddings for each word of the sentences in the batch
        """
        ## A4 code
        # output = self.embeddings(input)
        # return output
        ## End A4 code

        ### YOUR CODE HERE for part 1j
        X_word_emb_list = []
        for X_padded in input:
            # (batch_size,max_word_length) -> (batch_size,max_word_length,embed_size)
            X_emb = self.char_embedding(X_padded)
            # print(X_emb.size())
            X_reshaped = X_emb.permute(0,2,1)
            X_conv_out = self.convNN(X_reshaped)
            X_highway = self.highway(X_conv_out)
            X_word_emb = self.dropout(X_highway)
            X_word_emb_list.append(X_word_emb)

        # (sentence_length, batch_size, embed_size)
        X_word_emb = torch.stack(X_word_emb_list, dim=0)

        return X_word_emb
        ### END YOUR CODE

Character-based LSTM decoder for NMT

We will now add a LSTM-based character-level decoder to our NMT system. The main idea is that when our word-level decoder produces and <UNK> token, we run our character-level decoder (which you can think of as a character-level conditional language model) to instead generate the target word one character at a time, as shown in below figure. This will help us to produce rare and out-of-vocabulary target words.

Figure from cs224n. A character-based decoder which is triggered if the word-based decoder produces an UNK. Figure courtesy of Luong & Manning.

We now describe the model in three sections: 1. Forward computation of Character Decoder: Given a sequence of integers \(x_i,\cdots,x_n \in \mathbb{Z}\) representing a sequence of characters, we lookup their character embeddings \(x_i,\cdots,x_n \in \mathbb{Z}^{e_{char}}\) and pass these as input in to the(unidirectional)LSTM,obtaining hidden states \(h1, \cdots, h_n\) and cell states \(c_1, \cdots, c_n\) \[h_t,c_t = CharDecoderLSTM(x_t,h_{t-1},c_{t-1}) \quad \text{where} \quad h_t,c_t \in \mathbb{R}^{h}\] where h is the hidden size of the CharDecoderLSTM. The initial hidden and cell states \(h_0\) and \(c_0\) are both set to the combined output vector (attentioned) for the current timestep of the main word-level NMT decoder. For every timestep \(t \in { 1, \cdots, n }\) we compute scores (also called logits) \(s_t \in \mathbb{R}^{V_{char}}\) \[s_t = W_{dec}h_t + b_{dec} \in \mathbb{R}^{V_{char}}\] Where the weight matrix \(W_{dec} \in \mathbb{R}^{V_{char} \times h}\) and the bias vector \(b_{dec} \in \mathbb{R}^{V_{char}}\). If we passed \(s_t\) through a softmax function, we would have the probability distribution for the next character in the sequence.

Training of Character Decoder When we train the NMT system, we train the character decoder on every word in the target sentence (not just the words represented by ). For example, on a particular step of the main NMT decoder, if the target word is music then the input sequence for the CharDecoderLSTM is \([x_1,...,x_n]\) = [,m,u,s,i,c] and the target sequence for the CharDecoderLSTM is \([x_{2}, . . . , x_{n+1}]\) = [m,u,s,i,c,]. We pass the input sequence \(x_1, \cdots, x_n\), along with the initial states \(h_0\) and \(c_0\) obtained from the combined output vector) into the CharDecoderLSTM, thus obtaining scores \(s_1,\cdots, s_n\) which we will compare to the target sequence \(x_2,\cdots, x_{n+1}\). We optimize with respect to sum of the cross-entropy loss: \[p_t = softmax(s_t) \in \mathbb{R}^{V_{char}}\] \[loss_{char\_dec} = -\sum_{t=1}^{n}log p_t(x_{t+1})\]
Decoding from the Character Decoder t test time, first we produce a translation from our word- based NMT system in the usual way (e.g. a decoding algorithm like beam search). If the translation contains any tokens, then for each of those positions, we use the word-based decoder’s combined output vector to initialize the CharDecoderLSTM initial \(h_0\) and \(c_0\), then use CharDecoderLSTM to generate a sequence of characters. To generate the sequence of characters, we use the greedy decoding algorithm, which repeatedly chooses the most probable next character, until either the token is produced or we reach a predetermined max length. The algorithm is given below, for a single example (not batched).

Figure from cs224n. Greedy Decoding

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
CS224N 2018-19: Homework 5
"""

import torch
import torch.nn as nn

class CharDecoder(nn.Module):
    def __init__(self, hidden_size, char_embedding_size=50, target_vocab=None):
        """ Init Character Decoder.

        @param hidden_size (int): Hidden size of the decoder LSTM
        @param char_embedding_size (int): dimensionality of character embeddings
        @param target_vocab (VocabEntry): vocabulary for the target language. See vocab.py for documentation.
        """
        ### YOUR CODE HERE for part 2a
        ### TODO - Initialize as an nn.Module.
        ###      - Initialize the following variables:
        ###        self.charDecoder: LSTM. Please use nn.LSTM() to construct this.
        ###        self.char_output_projection: Linear layer, called W_{dec} and b_{dec} in the PDF
        ###        self.decoderCharEmb: Embedding matrix of character embeddings
        ###        self.target_vocab: vocabulary for the target language
        ###
        ### Hint: - Use target_vocab.char2id to access the character vocabulary for the target language.
        ###       - Set the padding_idx argument of the embedding matrix.
        ###       - Create a new Embedding layer. Do not reuse embeddings created in Part 1 of this assignment.
        
        super(CharDecoder, self).__init__()

        self.char_embedding_size = char_embedding_size
        self.target_vocab = target_vocab
        self.padding_idx = self.target_vocab.char2id['<pad>']
        self.decoderCharEmb = nn.Embedding(len(self.target_vocab.char2id),
                                           char_embedding_size,
                                           self.padding_idx)
        self.charDecoder = nn.LSTM(input_size=char_embedding_size,
                                   hidden_size=hidden_size)
        self.char_output_projection = nn.Linear(hidden_size,
                                                len(self.target_vocab.char2id))


        ### END YOUR CODE


    
    def forward(self, input, dec_hidden=None):
        """ Forward pass of character decoder.

        @param input: tensor of integers, shape (length, batch)
        @param dec_hidden: internal state of the LSTM before reading the input characters. A tuple of two tensors of shape (1, batch, hidden_size)

        @returns scores: called s_t in the PDF, shape (length, batch, self.vocab_size)
        @returns dec_hidden: internal state of the LSTM after reading the input characters. A tuple of two tensors of shape (1, batch, hidden_size)
        """
        ### YOUR CODE HERE for part 2b
        ### TODO - Implement the forward pass of the character decoder.
        char_embeddings = self.decoderCharEmb(input)        # (length, batch, char_embed_size)
        hidden_states, dec_hidden = self.charDecoder(
            char_embeddings, dec_hidden)    # (length, batch, hidden_size)
        scores = self.char_output_projection(hidden_states)     # (len, batch, vocab)

        return scores, dec_hidden
        ### END YOUR CODE

    


    def train_forward(self, char_sequence, dec_hidden=None):
        """ Forward computation during training.

        @param char_sequence: tensor of integers, shape (length, batch). Note that "length" here and in forward() need not be the same.
        @param dec_hidden: initial internal state of the LSTM, obtained from the output of the word-level decoder. A tuple of two tensors of shape (1, batch, hidden_size)

        @returns The cross-entropy loss, computed as the *sum* of cross-entropy losses of all the words in the batch.
        """
        ### YOUR CODE HERE for part 2c
        ### TODO - Implement training forward pass.
        ###
        ### Hint: - Make sure padding characters do not contribute to the cross-entropy loss.
        ###       - char_sequence corresponds to the sequence x_1 ... x_{n+1} from the handout (e.g., <START>,m,u,s,i,c,<END>).
        scores, dec_hidden = self.forward(char_sequence[:-1], dec_hidden)
        # char_embeddings = self.decoderCharEmb(char_sequence)
        # hidden_states, dec_hidden = self.charDecoder(char_embeddings[:-1], dec_hidden)
        # scores = self.char_output_projection(hidden_states)  # (len, batch, vocab)
        loss = nn.CrossEntropyLoss(ignore_index=self.padding_idx,
                                   reduction='sum')
        ce_loss = loss(scores.permute(1, 2, 0),
                       char_sequence[1:].transpose(1, 0))

        return ce_loss

        ### END YOUR CODE

    def decode_greedy(self, initialStates, device, max_length=21):
        """ Greedy decoding
        @param initialStates: initial internal state of the LSTM, a tuple of two tensors of size (1, batch, hidden_size)
        @param device: torch.device (indicates whether the model is on CPU or GPU)
        @param max_length: maximum length of words to decode

        @returns decodedWords: a list (of length batch) of strings, each of which has length <= max_length.
                              The decoded strings should NOT contain the start-of-word and end-of-word characters.
        """

        ### YOUR CODE HERE for part 2d
        ### TODO - Implement greedy decoding.
        ### Hints:
        ###      - Use target_vocab.char2id and target_vocab.id2char to convert between integers and characters
        ###      - Use torch.tensor(..., device=device) to turn a list of character indices into a tensor.
        ###      - We use curly brackets as start-of-word and end-of-word characters. That is, use the character '{' for <START> and '}' for <END>.
        ###        Their indices are self.target_vocab.start_of_word and self.target_vocab.end_of_word, respectively.
        
        
        ### END YOUR CODE

        output_words = []
        decodedWords = []
        start_idx = self.target_vocab.start_of_word
        end_idx = self.target_vocab.end_of_word
        dec_hidden = initialStates
        batch_size = dec_hidden[0].shape[1]
        current_char = torch.tensor([[start_idx] * batch_size],
                                    device=device)  # idx of '<start>' token

        for _ in range(max_length):
            scores, dec_hidden = self.forward(current_char, dec_hidden)
            current_char = scores.argmax(-1)
            output_words += [current_char]


        output_words = torch.cat(output_words).t().tolist()
        for foo in output_words:
            word = ""
            for bar in foo:
                if bar == end_idx:
                    break
                word += self.target_vocab.id2char[bar]
            decodedWords += [word]

        return decodedWords
        ### END YOUR CODE

Reference

Course note and slides of cs224n