41Hyperparameter Tuning of a Transformer Network with PyTorch Lightning
41.1 Basic Setup
This section provides an overview of the hyperparameter tuning process using spotpython and PyTorch Lightning. It uses the Diabetes data set (see Section E.1) for a regression task.
In this section, we will show how spotpython can be integrated into the PyTorch Lightning training workflow for a regression task. It demonstrates how easy it is to use spotpython to tune hyperparameters for a PyTorch Lightning model.
After importing the necessary libraries, the fun_control dictionary is set up via the fun_control_init function. The fun_control dictionary contains
PREFIX: a unique identifier for the experiment
fun_evals: the number of function evaluations
max_time: the maximum run time in minutes
data_set: the data set. Here we use the Diabetes data set that is provided by spotpython.
core_model_name: the class name of the neural network model. This neural network model is provided by spotpython.
hyperdict: the hyperparameter dictionary. This dictionary is used to define the hyperparameters of the neural network model. It is also provided by spotpython.
_L_in: the number of input features. Since the Diabetes data set has 10 features, _L_in is set to 10.
_L_out: the number of output features. Since we want to predict a single value, _L_out is set to 1.
The method set_hyperparameter allows the user to modify default hyperparameter settings. Here we set the initialization method to ["Default"]. No other initializations are used in this experiment. The HyperLight class is used to define the objective function fun. It connects the PyTorch and the spotpython methods and is provided by spotpython. Finally, a Spot object is created.
Note that we have enabled Tensorboard-Logging, so we can visualize the results with Tensorboard. Execute the following command in the terminal to start Tensorboard.
tensorboard --logdir="runs/"
41.2 Looking at the Results
41.2.1 Tuning Progress
After the hyperparameter tuning run is finished, the progress of the hyperparameter tuning can be visualized with spotpython’s method plot_progress. The black points represent the performace values (score or metric) of hyperparameter configurations from the initial design, whereas the red points represents the hyperparameter configurations found by the surrogate model based optimization.
This is the dimension of the embedding space or the number of expected features in the input.
All input features are projected into this dimensional space before entering the transformer encoder.
This dimension must be divisible by nhead since each head in the multi-head attention mechanism will process a subset of d_model/nhead features.
nhead:
This is the number of attention heads in the multi-head attention mechanism.
It allows the transformer to jointly attend to information from different representation subspaces.
It’s important that d_model % nhead == 0 to ensure the dimensions are evenly split among the heads.
num_encoder_layers:
This specifies the number of transformer encoder layers stacked together.
Each layer contains a multi-head attention mechanism followed by position-wise feedforward layers.
dim_feedforward:
This is the dimension of the feedforward network model within the transformer encoder layer.
Typically, this dimension is larger than d_model (e.g., 2048 for a Transformer model with d_model=512).
41.3.1 Important: Constraints and Interconnections:
d_model and nhead:
As mentioned, d_model must be divisible by nhead. This is critical because each attention head operates simultaneously on a part of the embedding, so d_model/nhead should be an integer.
num_encoder_layers and dim_feedforward**:
These parameters are more flexible and can be chosen independently of d_model and nhead.
However, the choice of dim_feedforward does influence the computational cost and model capacity, as larger dimensions allow learning more complex representations.
One hyperparameter does not strictly need to be a multiple of others except for ensuring d_model % nhead == 0.
41.3.2 Practical Considerations:
Setting d_model:
Common choices for d_model are powers of 2 (e.g., 256, 512, 1024).
Ensure that it matches the size of the input data after the linear projection layer.
Setting nhead:
Typically, values are 1, 2, 4, 8, etc., depending on the d_model value.
Each head works on a subset of features, so d_model / nhead should be large enough to be meaningful.
Setting num_encoder_layers:
Practical values range from 1 to 12 or more depending on the depth desired.
Deeper models can capture more complex patterns but are also more computationally intensive.
Setting dim_feedforward:
Often set to a multiple of d_model, such as 2048 when d_model is 512.
Ensures sufficient capacity in the intermediate layers for complex feature transformations.
Note: d_model Calculation
Since d_model % nhead == 0 is a critical constraint to ensure that the multi-head attention mechanism can operate effectively, spotpython computes the value of d_model based on the nhead value provided by the user. This ensures that the hyperparameter configuration is valid. So, the final value of d_model is a multiple of nhead. spotpython uses the hyperparameter d_model_mult to determine the multiple of nhead to use for d_model, i.e., d_model = nhead * d_model_mult.
Note: dim_feedforward Calculation
Since this dimension is typically larger than d_model (e.g., 2048 for a Transformer model with d_model=512), spotpython uses the hyperparameter dim_feedforward_mult to determine the multiple of d_model to use for dim_feedforward, i.e., dim_feedforward = d_model * dim_feedforward_mult.
41.4 Summary
This section presented an introduction to the basic setup of hyperparameter tuning of a transformer with spotpython and PyTorch Lightning.