- Category: Article
- Created: February 15, 2022 12:16 PM
- Status: Open
- URL: https://arxiv.org/pdf/2111.06377.pdf
- Updated: February 15, 2022 2:03 PM
- This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
- The author develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
- The author find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.
- Masking. Following ViT, we divide an image into regular non-overlapping patches. Then we sample a subset of patches and mask (i.e., remove) the remaining ones.
- MAE encoder. Our encoder is a ViT but applied only on visible, unmasked patches. Just as in a standard ViT, our encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
- MAE decoder. The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. We add positional embeddings to all tokens in this full set.
- Reconstruction target. Our MAE reconstructs the input by predicting the pixel values for each masked patch.
- In this study, we observe on ImageNet and in transfer learning that an autoencoder—a simple self-supervised method similar to techniques in NLP—provides scalable benefits.
- Self- supervised learning in vision may now be embarking on a similar trajectory as in NLP.