mBART translation

mBART translation#

NOTE given the specific nature of the task, only models and tokenizers of the mBART/m2M100 family are allowed.

The task proposed by Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, and Zettlemoyer [2020], Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis, and Zettlemoyer [2020] and Tang, Tran, Li, Chen, Goyal, Chaudhary, Gu, and Fan [2020] for training text-to-text models. In our case, we will mostly think of it as a translation task, but it could easily be adapted for other tasks such as summarization. It consists of a pretraining an encoder-decoder for a self-supervised denoising task, then fine-tuning it on a translation task, allowing to use non-parallel corpora to improve machine translation.

Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, and Zettlemoyer [2020] experimented with several noise functions, finally settling on text infilling and sentence shuffling. Since sentence shuffling assumes document-level processing and Zelda Rose is meant for sentence-level training, we only implement text infilling here, which consists of masking small spans of tokens with a single <MASK> token each:

  • Original sentence: “The little tabby cat is happy”

  • After infilling: “The little <MASK> happy”

The masked sentence serves as input and the expected output of the model is the original. Since the length of the target can not be easily be deduced from input, the models used for this task are encoder-decoders, such as the original transformer model [Vaswani et al., 2017].

Translation is, as in Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin [2017], also treated as a text-to-text task.

One innovation of Zelda Rose is that the models can also be trained simultaneously on denoising and translation, with a weight hyperparameter that controls each task’s contribution to the optimized loss.

Task parameters#

change_ratio: float = 0.3
denoise_langs: Optional[List[str]]
denoise_loss_ratio: float = 0.5
poisson_lambda: float = 3.0
source_langs: Optional[List[str]]
target_langs: Optional[List[str]]
strict_langs: bool = False
  • change_ratio is the proportion of tokens to which we apply some change either masking or switching.

  • denoise_langs, source_langs and target_langs are the codes for the languages in these respective roles. See below for their link with model and data format.

  • denoise_loss_ration is the weight (between \(0\) and \(1\)) given to the denoising loss in the multitask loss.

  • poisson_lambda is the \(λ\) parameter of the Poisson distribution from which the sizes of the masked spans are drawn

  • strict_langs is a flag controlling if the lang codes are allowed to only partially match between dataset and model/tokenizer.

Inputs and outputs#

For this task, the train and dev datasets should be in the jsonlines format, every row being a mapping between langcode and translation in the corresponding language such as

{"br": "Me am eus kanet", "fr": "J'ai chanté", "en": "I have sung"}

Or, for compatibility with 🤗 datasets, each row can be an arbitrary mapping, that has a "translation" key associated to a mapping in the previous format:

{"translation": {"br": "Me am eus kanet", "fr": "J'ai chanté", "en": "I have sung"}}

Inputs can come either from local files or from a 🤗 dataset.


[LLG+20] (1,2)

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.703.


Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics, 8:726–742, November 2020. doi:10.1162/tacl_a_00343.


Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. August 2020. arXiv:2008.00401, doi:10.48550/arXiv.2008.00401.

[VSP+17] (1,2)

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, 5998–6008. Long Beach, California, 2017. Curran Associates, Inc. URL: (visited on 2019-02-16).