Masked Language Modeling

Masked Language Modeling#

The task proposed inter alios by Devlin, Chang, Lee, and Toutanova [2018] for pretraining of bidirectional encoders. It consists in training a model to denoise a sequence, where the noise preserves the sentence length, most famously by masking some words:

  • Original sentence: “The little tabby cat is happy”

  • Masked sentence: “The little <MASK> cat is happy”

The masked sentence serves as input and the expected output of the model is the original. For transformers model, it usually entails using an encoder-only architecture and using a thin word predictor “head” (for instance a linear layer) on top of its last hidden states. Muller [2022] gives a good general introduction to these models.

More specifically, following e.g. the pretraining procedure of RoBERTa pretraining [Liu et al., 2019], in Zelda Rose, we apply two types of to the input sentences :

  • Masking: replacing some tokens with a mask token, as in the example above.

  • Switching replacing some tokens with a mask token by replacing them with another random token from the vocabulary.

Task parameters#

change_ratio: float = 0.15
mask_ratio: float = 0.8
switch_ratio: float = 0.1
  • change_ratio is the proportion of tokens to which we apply some change either masking or switching.

  • mask_ratio is the proportion of tokens targetted for a change that are replaced by a mask token.

  • switch_ratio is the proportion of tokens targetted for a change that are replaced by a random non-mask token

Note that all of these should be floats between 0.0 and 1.0 and that you should have mask_ratio+switch_ratio <= 1.0 too.

Inputs and outputs#

For this task, the train and dev datasets should be raw text, every line containing a single sample (typically a sentence). It can come either from a local text file or from a 🤗 text dataset.



Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs], October 2018. URL: (visited on 2019-02-16), arXiv:1810.04805.


Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs], July 2019. URL: (visited on 2019-09-05), arXiv:1907.11692.


Britney Muller. BERT 101 🤗 State Of The Art NLP Model Explained. March 2022. URL: (visited on 2023-02-25).