The problems of RNN

  1. Sequential computation inhibit parallelization
  2. No explicit modeling of long and short range
  3. We want to model hierarchy (RNNs seem wasteful)


ELMo means Embeddings from Language Models. the original paper is from https://arxiv.org/abs/1802.05365

  1. Breakout version of word token vectors or contextual word vectors
  2. Learn word token vectors using long contexts not context windows (here, whole sentence, could be longer)
  3. Learn a deep Bi-NLM and use all its layers in prediction

What’s ELMo’s secret?

ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called Language Modeling. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.

char cnn embedding from ELMo

A step in the pre-training process of ELMo: Given “Let’s stick to” as input, predict the next most likely word – a language modeling task. When trained on a large dataset, the model starts to pick up on language patterns. It’s unlikely it’ll accurately guess the next word in this example. More realistically, after a word such as “hang”, it will assign a higher probability to a word like “out” (to spell “hang out”) than to “camera”.

bilstm LM

ELMo actually goes a step further and trains a bi-directional LSTM – so that its language model doesn’t only have a sense of the next word, but also the previous word.

  1. 前向LSTM结构:
  2. 反向LSTM结构:
  3. 最大似然函数:
  4. 线性组合公式:
bilstm from ELMo

Char cnn embedding

The input of elmo is char embedding, see the details from https://zhangruochi.com/Subword-Models/2019/12/19/

char cnn embedding from ELMo

How to use ELMo when after pre-training

We can feed our input data to the pre-trained ELMo and get the representation of dynamic word vectors. And then we use them to our specific tasks.

ELMo used in a sequence tagger

OpenAI Transformer: Pre-training a Transformer Decoder for Language Modeling

It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer. The decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.

The model stacked twelve decoder layers. Since there is no encoder in this set up, these decoder layers would not have the encoder-decoder attention sublayer that vanilla transformer decoder layers have. It would still have the self-attention layer, however (masked so it doesn’t peak at future tokens).

With this structure, we can proceed to train the model on the same language modeling task: predict the next word using massive (unlabeled) datasets. Just, throw the text of 7,000 books at it and have it learn! Books are great for this sort of task since it allows the model to learn to associate related information even if they’re separated by a lot of text – something you don’t get for example, when you’re training with tweets, or articles.

$W_e$ is the embedding matrix, $W_p$ is the positional embedding matrix(Note that it is different with classicial transformer)

Fine-Tuning with OpenAI

Now that the OpenAI transformer is pre-trained and its layers have been tuned to reasonably handle language, we can start using it for downstream tasks. Let’s first look at sentence classification (classify an email message as “spam” or “not spam”):

If our input sequence is $x_1,\cdots,x_m$, and the label is y. We can add a softmax layer to do classification and use the cross entrophy to calculate the loss.

In general, we should update the parameters to minimize the $L_2$, but we can use Multi-task Learning to get a more generalize model. Therefore we can get the max likelihood of $L3$

$L_1$ if the loss of previous language model.

How to use a pre-trained OpenAI transformer to do sentence clasification

The OpenAI paper outlines a number of input transformations to handle the inputs for different types of tasks. The following image from the paper shows the structures of the models and input transformations to carry out different tasks.

How to use a pre-trained OpenAI transformer to do different tasks

BERT: From Decoders to Encoders

The openAI transformer gave us a fine-tunable pre-trained model based on the Transformer. But something went missing in this transition from LSTMs to Transformers. ELMo’s language model was bi-directional, but the openAI transformer only trains a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?


The input representation of BERT is shown in the figure below. For example, the two sentences “my dog ​​is cute” and “he likes playing” are entered. I’ll explain why two sentences are needed later. Here, the two sentences similar to GPT are used. First, a special Token [CLS] is added at the beginning of the first sentence, and a [SEP] is added after the cute to indicate the end of the first sentence. After ##ing, A [SEP] will be added later. Note that the word segmentation here will divide “playing” into “play” and “##ing” two tokens. This method of dividing words into more fine-grained Word Pieces was introduced in the previous machine translation section. This is a kind of Common methods to resolve unregistered words. Then perform 3 Embeddings on each Token:

  1. Embedding of words;
  2. Embedding of positions;
  3. Embedding of segments.

The word Embedding is familiar to everyone, and the position Embedding is similar to the word embedding, mapping a position (such as 2) into a low-dimensional dense vector. And Segment embedding has only two, either belong to the first sentence (segment) or belong to the second sentence. Segment Embedding of the same sentence is shared so that it can learn information belonging to different segments. For tasks such as sentiment classification, there is only one sentence, so the Segment id is always 0; for the Entailment task, the input is two sentences, so the Segment is 0 or 1.

The BERT model requires a fixed sequence length, such as 128. If it is not enough, then padding in the back, otherwise it will intercept the excess Token, so as to ensure that the input is a fixed-length Token sequence. The first token is always special [CLS]. It does not have any semantics, so it will (must) encode the semantics of the entire sentence (other words).

Input of Bert/div>

Masked Language Model

Finding the right task to train a Transformer stack of encoders is a complex hurdle that BERT resolves by adopting a masked language model concept from earlier literature (where it’s called a Cloze task).

Beyond masking 15% of the input, BERT also mixes things a bit in order to improve how the model later fine-tunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.

masked language model

Two-sentence Tasks

If you look back up at the input transformations the OpenAI transformer does to handle different tasks, you’ll notice that some tasks require the model to say something intelligent about two sentences (e.g. are they simply paraphrased versions of each other? Given a wikipedia entry as input, and a question regarding that entry as another input, can we answer that question?).

To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?

The second task BERT is pre-trained on is a two-sentence classification task.

Task specific-Models

The BERT paper shows a number of ways to use BERT for different tasks.

different ways to use BERT
  1. For common classification tasks, the input is a sequence, as shown in the upper right of the figure. All tokens belong to the same Segment (Id = 0). We use the last layer of the first special token [CLS] to connect it. Softmax is used for classification, and classified data is used for Fine-Tuning.
  2. For tasks such as similarity calculation that are input as two sequences, the process is shown in the upper left. The tokens of the two sequences correspond to different segments (Id = 0/1). We also use the last layer output of the first special token [CLS] to connect with softmax for classification, and then use the classification data for Fine-Tuning.
  3. The third type is a question-and-answer type question, such as the SQuAD v1.1 dataset. The input is a question and a long paragraph containing the answer (Paragraph), and the output finds the answer to the question in this paragraph.
  4. The forth type of task is sequence labeling, such as named entity recognition. The input is a sentence (Token sequence). Except for [CLS] and [SEP], there will be output tags at each moment. For example, B-PER indicates the beginning of a person’s name.

BERT for feature extraction

The fine-tuning approach isn’t the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word embeddings. Then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition.

Feature extraction

Which vector works best as a contextualized embedding? I would think it depends on the task. The paper examines six choices (Compared to the fine-tuned model which achieved a score of 96.4):

Feature extraction


  1. http://fancyerii.github.io/2019/03/09/bert-theory/#词汇扩展
  2. https://zhuanlan.zhihu.com/p/63115885
  3. http://jalammar.github.io/illustrated-bert/