## Character-Level Models

1. Word embeddings can be composed from character embeddings
• Generates embeddings for unknown words
• Similar spellings share similar embeddings
• Solves OOV problem
2. Motivation
• Derive a powerful,robust language model effective across a variety of languages.
• Encode subword relatedness:eventful,eventfully, uneventful…
• Address rare-word problem of prior models.
• Obtain comparable expressivity with fewer parameters.
1. Same architecture as forword-level model but use smaller units: “word pieces”
2. Hybrid architectures: Main model has words; something else for characters

## Byte Pair Encoding

A word segmentation algorithm:

• Most frequent ngram pairs -> a new ngram
• Have a target vocabulary size and stop when you reach it
• Do deterministic longest piece segmentation of words
• Segmentation is only within words identified by some prior tokenizer

For example, all the words in our documents database and their frequency are

{‘l o w’: 5, ‘l o w e r’: 2, ‘n e w e s t’: 6, ‘w i d e s t’: 3}

We can initialize our vocabulary library as:

{ ‘l’, ‘o’, ‘w’, ‘e’, ‘r’, ‘n’, ‘w’, ‘s’, ‘t’, ‘i’, ‘d’}

The most frequent ngram pair is (‘e’,’s’) and its count is 9. So we add the ‘es’ to our vocabulary library.

Our documents database now is:

{‘l o w’: 5, ‘l o w e r’: 2, ‘n e w es t’: 6, ‘w i d es t’: 3}.

Our vocabulary library now is:

{ ‘l’, ‘o’, ‘w’, ‘e’, ‘r’, ‘n’, ‘w’, ‘s’, ‘t’, ‘i’, ‘d’, ‘es’}

Again, the most frequent ngram pair is (‘es’,’t’) and its count is 9，So we add the ‘est’ to our vocabulary library.

Our documents database now is:

{‘l o w’: 5, ‘l o w e r’: 2, ‘n e w est’: 6, ‘w i d est’: 3}

Our vocabulary library now is:

{ ‘l’, ‘o’, ‘w’, ‘e’, ‘r’, ‘n’, ‘w’, ‘s’, ‘t’, ‘i’, ‘d’, ‘es’,’est’}

the rest can be done in the same manner. We can set a threshold of total count of our vocabulary library. By doing so, we can use BPE to construct a vocabulary library to represent all the words based on subword unit.

Google NMT(GNMT) uses a variant of this:

• V1: wordpiece model (Word piece model tokenizes inside words)
• V2: sentencepiece model (Sentence piece model works from raw text)

## Character-level to build word-level

1. Convolution over characters to generate word embeddings
2. Character-based LSTM to build word representation

## CS224n Assignment5

### Character-based convolutional encoder for NMT.

Figure from cs224n. Character-based convolutional encoder, which ultimately produces a word embedding of length $e_{word}$
1. Convert word to character indices. We have a word $x$ (e.g. Anarchy in above figure) that we wish to represent. Assume we have a predefined ‘vocabulary’ of characters (for example, all lowercase letters, uppercase letters, numbers, and some punctuation). By looking up the index of each character, we can thus represent the length-l word x as a vector of integers:where each $c_i$ is an integer index into the character vocabulary.
2. Padding and embedding lookup. Using a special ‘character’, we pad (or truncate) every word so that it has length $m_word$ (this is some predefined hyperparameter representing maximum word length):

3. For each of these characters $c_i$, we lookup a dense character embedding (which has shape $e_{char}$). This yields a tensor $x_{emb}$:

We’ll reshape $x_{emb}$ to obtain $x_{reshaped} in \mathbb{R}^{e_{char} \times m_{word}}$ before feeding into the convolutional network.

4. Convolutional network. To combine these character embeddings, we’ll use 1-dimensional convolutions. The convolutional layer has two hyperparameters: the kernel size $k$ (also called window size), which dictates the size of the window used to compute features, and the number of filters $f$, (also called number of output features or number of output channels). The convolutional layer has a weight matrix $W \in \mathbb{R}^{f \times e_{char} \times k}$ and a bias vector $b \in \mathbb{R}^{f}$. Overall this produces output $x_{conv}$.

For our application, we’ll set $f$ to be equal to $e_{word}$, the size of the final word embedding for word x. Therefore,

Finally, we apply the ReLU function to $x_{conv}$, then use max-pooling to reduce this to a single vector $x_{conv_out} \in \mathbb{R}^{e_{word}}$, which is the final output of the Convolutional Network:

5. Highway layer and dropout. Highway Networks6 have a skip-connection controlled by a dynamic gate. Given the input $x_{conv\_out} \in \mathbb{R}^{e_{word}}$, we compute:

Figure from cs224n. Highway Network (Srivastava et al. 2015)

Where $W_{proj},W_{gate} \in \mathbb{R}^{e_{word} \times e_{word}}$, and $\circ$ denotes element-wise multiplication.

6. Combine above steps together to get our Character-based word embedding model.

### Character-based LSTM decoder for NMT

We will now add a LSTM-based character-level decoder to our NMT system. The main idea is that when our word-level decoder produces and <UNK> token, we run our character-level decoder (which you can think of as a character-level conditional language model) to instead generate the target word one character at a time, as shown in below figure. This will help us to produce rare and out-of-vocabulary target words.

Figure from cs224n. A character-based decoder which is triggered if the word-based decoder produces an UNK. Figure courtesy of Luong & Manning.

We now describe the model in three sections:

1. Forward computation of Character Decoder: Given a sequence of integers $x_i,\cdots,x_n \in \mathbb{Z}$ representing a sequence of characters, we lookup their character embeddings $x_i,\cdots,x_n \in \mathbb{Z}^{e_{char}}$ and pass these as input in to the(unidirectional)LSTM,obtaining hidden states $h1, \cdots, h_n$ and cell states $c_1, \cdots, c_n$

where h is the hidden size of the CharDecoderLSTM. The initial hidden and cell states $h_0$ and $c_0$ are both set to the combined output vector (attentioned) for the current timestep of the main word-level NMT decoder.
For every timestep $t \in { 1, \cdots, n }$ we compute scores (also called logits) $s_t \in \mathbb{R}^{V_{char}}$

Where the weight matrix $W_{dec} \in \mathbb{R}^{V_{char} \times h}$ and the bias vector $b_{dec} \in \mathbb{R}^{V_{char}}$. If we passed $s_t$ through a softmax function, we would have the probability distribution for the next character in the sequence.

2. Training of Character Decoder When we train the NMT system, we train the character decoder on every word in the target sentence (not just the words represented by ). For example, on a particular step of the main NMT decoder, if the target word is music then the input sequence for the CharDecoderLSTM is $[x_1,…,x_n]$ = [,m,u,s,i,c] and the target sequence for the CharDecoderLSTM is $[x_{2}, . . . , x_{n+1}]$ = [m,u,s,i,c,].
We pass the input sequence $x_1, \cdots, x_n$, along with the initial states $h_0$ and $c_0$ obtained from the combined output vector) into the CharDecoderLSTM, thus obtaining scores $s_1,\cdots, s_n$ which we will compare to the target sequence $x_2,\cdots, x_{n+1}$. We optimize with respect to sum of the cross-entropy loss:

3. Decoding from the Character Decoder t test time, first we produce a translation from our word- based NMT system in the usual way (e.g. a decoding algorithm like beam search). If the translation contains any tokens, then for each of those positions, we use the word-based decoder’s combined output vector to initialize the CharDecoderLSTM initial $h_0$ and $c_0$, then use CharDecoderLSTM to generate a sequence of characters. To generate the sequence of characters, we use the greedy decoding algorithm, which repeatedly chooses the most probable next character, until either the token is produced or we reach a predetermined max length. The algorithm is given below, for a single example (not batched).

Figure from cs224n. Greedy Decoding

## Reference

• Course note and slides of cs224n
Donate article here