max_epochs: Optional[int] = None
max_steps: Optional[int] = None
Respectively the maximum number of epochs (full pass across the dataset) or [optimisation] steps to train for. If both are set, whichever of these two is reached first will stop training.
batch_size: int = 64.
This is the number of sample in a forward-backward pass. If you use several devices and/or have
device batches of a size bigger than \(1\), this must be a multiple of
betas: Tuple[float, float] = (0.9, 0.98)
epsilon: float = 1e-8
learning_rate: float = 1e-4
weight_decay: Optional[float] = None
These are respectively the \(β\) and \(ε\) parameters and the base learning rate for the Adam optimizer [Kingma and Ba, 2014] and the weight decay rate. See the Pytorch documentation for more details.
gradient_clipping: Optional[Union[float, int]] = None
None, this is the maximum allowed gradient norm. Longer gradients will be clipped to this
length, preserving their direction. See the Pytorch
Learning rate schedule#
lr_decay_steps: Optional[int] = None
warmup_steps: int = 0
These are the number of step in the slanted triangular learning rate schedule
[Howard and Ruder, 2018]: the base learning rate is made to follow an upward linear
warmup_steps steps up to
learning_rate, then decayed linearly to \(0\) in
Note that setting
Jeremy Howard and Sebastian Ruder. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 328–339. Association for Computational Linguistics, July 2018. doi:10.18653/v1/P18-1031.