Highlights

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
The author develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
The author find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.

Methods

Masking. Following ViT, we divide an image into regular non-overlapping patches. Then we sample a subset of patches and mask (i.e., remove) the remaining ones.
MAE encoder. Our encoder is a ViT but applied only on visible, unmasked patches. Just as in a standard ViT, our encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
MAE decoder. The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. We add positional embeddings to all tokens in this full set.
Reconstruction target. Our MAE reconstructs the input by predicting the pixel values for each masked patch.

In this study, we observe on ImageNet and in transfer learning that an autoencoder—a simple self-supervised method similar to techniques in NLP—provides scalable benefits.
Self- supervised learning in vision may now be embarking on a similar trajectory as in NLP.