Tuning configurations#


Training duration#

  • max_epochs: Optional[int] = None

  • max_steps: Optional[int] = None

Respectively the maximum number of epochs (full pass across the dataset) or [optimisation] steps to train for. If both are set, whichever of these two is reached first will stop training.

Batch size#

  • batch_size: int = 64.

This is the number of sample in a forward-backward pass. If you use several devices and/or have device batches of a size bigger than \(1\), this must be a multiple of device_batch_size*total_devices

Adam parameters#

  • betas: Tuple[float, float] = (0.9, 0.98)

  • epsilon: float = 1e-8

  • learning_rate: float = 1e-4

  • weight_decay: Optional[float] = None

These are respectively the \(β\) and \(ε\) parameters and the base learning rate for the Adam optimizer [Kingma and Ba, 2014] and the weight decay rate. See the Pytorch documentation for more details.

Gradient clipping#

  • gradient_clipping: Optional[Union[float, int]] = None

If non-None, this is the maximum allowed gradient norm. Longer gradients will be clipped to this length, preserving their direction. See the Pytorch documentation for implementation details.

Learning rate schedule#

  • lr_decay_steps: Optional[int] = None

  • warmup_steps: int = 0

These are the number of step in the slanted triangular learning rate schedule [Howard and Ruder, 2018]: the base learning rate is made to follow an upward linear slope for warmup_steps steps up to learning_rate, then decayed linearly to \(0\) in lr_decay_steps.

Note that setting lr_decay_steps overrides max_steps.



Jeremy Howard and Sebastian Ruder. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 328–339. Association for Computational Linguistics, July 2018. doi:10.18653/v1/P18-1031.


Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations. December 2014. URL: http://arxiv.org/abs/1412.6980 (visited on 2019-03-04), arXiv:1412.6980.