Masked Language Modeling#
The task proposed inter alios by Devlin, Chang, Lee, and Toutanova  for pretraining of bidirectional encoders. It consists in training a model to denoise a sequence, where the noise preserves the sentence length, most famously by masking some words:
Original sentence: “The little tabby cat is happy”
Masked sentence: “The little <MASK> cat is happy”
The masked sentence serves as input and the expected output of the model is the original. For transformers model, it usually entails using an encoder-only architecture and using a thin word predictor “head” (for instance a linear layer) on top of its last hidden states. Muller  gives a good general introduction to these models.
More specifically, following e.g. the pretraining procedure of RoBERTa pretraining [Liu et al., 2019], in Zelda Rose, we apply two types of to the input sentences :
Masking: replacing some tokens with a mask token, as in the example above.
Switching replacing some tokens with a mask token by replacing them with another random token from the vocabulary.
change_ratio: float = 0.15 mask_ratio: float = 0.8 switch_ratio: float = 0.1
change_ratiois the proportion of tokens to which we apply some change either masking or switching.
mask_ratiois the proportion of tokens targetted for a change that are replaced by a mask token.
switch_ratiois the proportion of tokens targetted for a change that are replaced by a random non-mask token
Note that all of these should be floats between
1.0 and that you should have
mask_ratio+switch_ratio <= 1.0 too.
Inputs and outputs#
For this task, the train and dev datasets should be raw text, every line containing a single sample (typically a sentence). It can come either from a local text file or from a 🤗 text dataset.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs], October 2018. URL: http://arxiv.org/abs/1810.04805 (visited on 2019-02-16), arXiv:1810.04805.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs], July 2019. URL: http://arxiv.org/abs/1907.11692 (visited on 2019-09-05), arXiv:1907.11692.