37  Hyperparameter Tuning of a Transformer Network with PyTorch Lightning

37.1 Basic Setup

This section provides an overview of the hyperparameter tuning process using spotpython and PyTorch Lightning. It uses the Diabetes data set (see Section E.1) for a regression task.

In this section, we will show how spotpython can be integrated into the PyTorch Lightning training workflow for a regression task. It demonstrates how easy it is to use spotpython to tune hyperparameters for a PyTorch Lightning model.

After importing the necessary libraries, the fun_control dictionary is set up via the fun_control_init function. The fun_control dictionary contains

  • PREFIX: a unique identifier for the experiment
  • fun_evals: the number of function evaluations
  • max_time: the maximum run time in minutes
  • data_set: the data set. Here we use the Diabetes data set that is provided by spotpython.
  • core_model_name: the class name of the neural network model. This neural network model is provided by spotpython.
  • hyperdict: the hyperparameter dictionary. This dictionary is used to define the hyperparameters of the neural network model. It is also provided by spotpython.
  • _L_in: the number of input features. Since the Diabetes data set has 10 features, _L_in is set to 10.
  • _L_out: the number of output features. Since we want to predict a single value, _L_out is set to 1.

The method set_hyperparameter allows the user to modify default hyperparameter settings. Here we set the initialization method to ["Default"]. No other initializations are used in this experiment. The HyperLight class is used to define the objective function fun. It connects the PyTorch and the spotpython methods and is provided by spotpython. Finally, a Spot object is created.

from spotpython.data.diabetes import Diabetes
from spotpython.hyperdict.light_hyper_dict import LightHyperDict
from spotpython.fun.hyperlight import HyperLight
from spotpython.utils.init import (fun_control_init, surrogate_control_init, design_control_init)
from spotpython.utils.eda import gen_design_table
from spotpython.hyperparameters.values import set_hyperparameter
from spotpython.spot import spot
from spotpython.utils.file import get_experiment_filename
from spotpython.utils.scaler import TorchStandardScaler

fun_control = fun_control_init(
    PREFIX="603",
    TENSORBOARD_CLEAN=True,
    tensorboard_log=True,
    fun_evals=inf,
    max_time=1,
    data_set = Diabetes(),
    scaler=TorchStandardScaler(),
    core_model_name="light.regression.NNTransformerRegressor",
    hyperdict=LightHyperDict,
    _L_in=10,
    _L_out=1)

set_hyperparameter(fun_control, "optimizer", [
                "Adadelta",
                "Adagrad",
                "Adam",
                "AdamW",
                "Adamax",
            ])
set_hyperparameter(fun_control, "epochs", [5, 7])
set_hyperparameter(fun_control, "nhead", [1, 2])
set_hyperparameter(fun_control, "dim_feedforward_mult", [1, 1])

design_control = design_control_init(init_size=5)
surrogate_control = surrogate_control_init(
    noise=True,
    min_Lambda=1e-3,
    max_Lambda=10,
)

fun = HyperLight().fun

spot_tuner = spot.Spot(fun=fun,fun_control=fun_control, design_control=design_control, surrogate_control=surrogate_control)
Moving TENSORBOARD_PATH: runs/ to TENSORBOARD_PATH_OLD: runs_OLD/runs_2024_12_14_21_08_47
Created spot_tensorboard_path: runs/spot_logs/603_maans08_2024-12-14_21-08-47 for SummaryWriter()
module_name: light
submodule_name: regression
model_name: NNTransformerRegressor

We can take a look at the design table to see the initial design.

print(gen_design_table(fun_control))
| name                 | type   | default        |   lower |   upper | transform             |
|----------------------|--------|----------------|---------|---------|-----------------------|
| d_model_mult         | int    | 4              |    1    |     5   | transform_power_2_int |
| nhead                | int    | 3              |    1    |     2   | transform_power_2_int |
| num_encoder_layers   | int    | 1              |    1    |     4   | transform_power_2_int |
| dim_feedforward_mult | int    | 1              |    1    |     1   | transform_power_2_int |
| epochs               | int    | 7              |    5    |     7   | transform_power_2_int |
| batch_size           | int    | 5              |    5    |     8   | transform_power_2_int |
| optimizer            | factor | Adam           |    0    |     4   | None                  |
| dropout              | float  | 0.1            |    0.01 |     0.1 | None                  |
| lr_mult              | float  | 0.1            |    0.01 |     0.3 | None                  |
| patience             | int    | 5              |    4    |     7   | transform_power_2_int |
| initialization       | factor | xavier_uniform |    0    |     3   | None                  |

If we want to run the hyperparameter tuning process on a remote server, we can save the setting as a pickle file and load it on the remote server.

filename = get_experiment_filename(fun_control["PREFIX"])
# if userExperimnents directory does not exist, create it
if not os.path.exists("userExperiment"):
    os.makedirs("userExperiment")
filename = os.path.join("userExperiment", filename)
if spot_tuner.spot_writer is not None:
    spot_tuner.spot_writer.close()
# remove attribute spot_writer from spot_tuner object
if hasattr(spot_tuner, "spot_writer"):
    delattr(spot_tuner, "spot_writer")
spot_tuner.save_experiment(filename=filename)
Experiment saved to userExperiment/spot_603_experiment.pickle

Calling the method run() starts the hyperparameter tuning process on the local machine.

res = spot_tuner.run()

In fun(): config:
{'batch_size': 128,
 'd_model_mult': 4,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.011379790035512986),
 'epochs': 64,
 'initialization': 'kaiming_normal',
 'lr_mult': np.float64(0.2259586849277678),
 'nhead': 2,
 'num_encoder_layers': 8,
 'optimizer': 'Adam',
 'patience': 32}
d_model: 8, dim_feedforward: 16
train_model result: {'val_loss': 23954.24609375, 'hp_metric': 23954.24609375}

In fun(): config:
{'batch_size': 32,
 'd_model_mult': 32,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.029306669346546993),
 'epochs': 128,
 'initialization': 'xavier_normal',
 'lr_mult': np.float64(0.17054932506082773),
 'nhead': 4,
 'num_encoder_layers': 4,
 'optimizer': 'Adagrad',
 'patience': 16}
d_model: 128, dim_feedforward: 256
train_model result: {'val_loss': 20689.533203125, 'hp_metric': 20689.533203125}

In fun(): config:
{'batch_size': 128,
 'd_model_mult': 8,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.09151960187997173),
 'epochs': 32,
 'initialization': 'kaiming_normal',
 'lr_mult': np.float64(0.2577618869653708),
 'nhead': 4,
 'num_encoder_layers': 8,
 'optimizer': 'Adamax',
 'patience': 32}
d_model: 32, dim_feedforward: 64
train_model result: {'val_loss': 23385.4296875, 'hp_metric': 23385.4296875}

In fun(): config:
{'batch_size': 64,
 'd_model_mult': 8,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.049653216042697644),
 'epochs': 32,
 'initialization': 'xavier_uniform',
 'lr_mult': np.float64(0.017817140655063284),
 'nhead': 2,
 'num_encoder_layers': 16,
 'optimizer': 'AdamW',
 'patience': 128}
d_model: 16, dim_feedforward: 32
train_model result: {'val_loss': 23920.986328125, 'hp_metric': 23920.986328125}

In fun(): config:
{'batch_size': 256,
 'd_model_mult': 4,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.0817394594177788),
 'epochs': 64,
 'initialization': 'kaiming_uniform',
 'lr_mult': np.float64(0.1241119885468178),
 'nhead': 2,
 'num_encoder_layers': 2,
 'optimizer': 'Adagrad',
 'patience': 64}
d_model: 8, dim_feedforward: 16
train_model result: {'val_loss': 23945.990234375, 'hp_metric': 23945.990234375}

In fun(): config:
{'batch_size': 32,
 'd_model_mult': 32,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.03546522045759425),
 'epochs': 128,
 'initialization': 'xavier_normal',
 'lr_mult': np.float64(0.25254410128334565),
 'nhead': 4,
 'num_encoder_layers': 2,
 'optimizer': 'Adagrad',
 'patience': 16}
d_model: 128, dim_feedforward: 256
train_model result: {'val_loss': 19936.576171875, 'hp_metric': 19936.576171875}
spotpython tuning: 19936.576171875 [##--------] 15.71% 

In fun(): config:
{'batch_size': 32,
 'd_model_mult': 32,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.046273154967975315),
 'epochs': 128,
 'initialization': 'xavier_uniform',
 'lr_mult': np.float64(0.3),
 'nhead': 4,
 'num_encoder_layers': 2,
 'optimizer': 'Adagrad',
 'patience': 16}
d_model: 128, dim_feedforward: 256
train_model result: {'val_loss': 19263.810546875, 'hp_metric': 19263.810546875}
spotpython tuning: 19263.810546875 [###-------] 31.87% 

In fun(): config:
{'batch_size': 32,
 'd_model_mult': 32,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.1),
 'epochs': 128,
 'initialization': 'kaiming_normal',
 'lr_mult': np.float64(0.3),
 'nhead': 2,
 'num_encoder_layers': 2,
 'optimizer': 'AdamW',
 'patience': 16}
d_model: 64, dim_feedforward: 128
train_model result: {'val_loss': 21748.39453125, 'hp_metric': 21748.39453125}
spotpython tuning: 19263.810546875 [#####-----] 50.69% 

In fun(): config:
{'batch_size': 32,
 'd_model_mult': 32,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.06805391152159825),
 'epochs': 128,
 'initialization': 'xavier_uniform',
 'lr_mult': np.float64(0.3),
 'nhead': 4,
 'num_encoder_layers': 4,
 'optimizer': 'Adam',
 'patience': 16}
d_model: 128, dim_feedforward: 256
train_model result: {'val_loss': 20375.798828125, 'hp_metric': 20375.798828125}
spotpython tuning: 19263.810546875 [########--] 82.91% 

In fun(): config:
{'batch_size': 32,
 'd_model_mult': 32,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.060659418167432505),
 'epochs': 128,
 'initialization': 'xavier_uniform',
 'lr_mult': np.float64(0.3),
 'nhead': 4,
 'num_encoder_layers': 2,
 'optimizer': 'Adagrad',
 'patience': 16}
d_model: 128, dim_feedforward: 256
train_model result: {'val_loss': 19384.5625, 'hp_metric': 19384.5625}
spotpython tuning: 19263.810546875 [##########] 98.80% 

In fun(): config:
{'batch_size': 32,
 'd_model_mult': 32,
 'dim_feedforward_mult': 2,
 'dropout': np.float64(0.01),
 'epochs': 128,
 'initialization': 'xavier_uniform',
 'lr_mult': np.float64(0.3),
 'nhead': 4,
 'num_encoder_layers': 2,
 'optimizer': 'Adagrad',
 'patience': 16}
d_model: 128, dim_feedforward: 256
train_model result: {'val_loss': 19253.60546875, 'hp_metric': 19253.60546875}
spotpython tuning: 19253.60546875 [##########] 100.00% Done...

Note that we have enabled Tensorboard-Logging, so we can visualize the results with Tensorboard. Execute the following command in the terminal to start Tensorboard.

tensorboard --logdir="runs/"

37.2 Looking at the Results

37.2.1 Tuning Progress

After the hyperparameter tuning run is finished, the progress of the hyperparameter tuning can be visualized with spotpython’s method plot_progress. The black points represent the performace values (score or metric) of hyperparameter configurations from the initial design, whereas the red points represents the hyperparameter configurations found by the surrogate model based optimization.

spot_tuner.plot_progress(log_y=True, filename=None)

37.2.2 Tuned Hyperparameters and Their Importance

Results can be printed in tabular form.

from spotpython.utils.eda import gen_design_table
print(gen_design_table(fun_control=fun_control, spot=spot_tuner))
| name                 | type   | default        |   lower |   upper | tuned          | transform             |   importance | stars   |
|----------------------|--------|----------------|---------|---------|----------------|-----------------------|--------------|---------|
| d_model_mult         | int    | 4              |     1.0 |     5.0 | 5.0            | transform_power_2_int |       100.00 | ***     |
| nhead                | int    | 3              |     1.0 |     2.0 | 2.0            | transform_power_2_int |         2.44 | *       |
| num_encoder_layers   | int    | 1              |     1.0 |     4.0 | 1.0            | transform_power_2_int |         0.42 | .       |
| dim_feedforward_mult | int    | 1              |     1.0 |     1.0 | 1.0            | transform_power_2_int |         0.00 |         |
| epochs               | int    | 7              |     5.0 |     7.0 | 7.0            | transform_power_2_int |         0.04 |         |
| batch_size           | int    | 5              |     5.0 |     8.0 | 5.0            | transform_power_2_int |         0.00 |         |
| optimizer            | factor | Adam           |     0.0 |     4.0 | Adagrad        | None                  |         0.05 |         |
| dropout              | float  | 0.1            |    0.01 |     0.1 | 0.01           | None                  |         2.74 | *       |
| lr_mult              | float  | 0.1            |    0.01 |     0.3 | 0.3            | None                  |         0.01 |         |
| patience             | int    | 5              |     4.0 |     7.0 | 4.0            | transform_power_2_int |         0.00 |         |
| initialization       | factor | xavier_uniform |     0.0 |     3.0 | xavier_uniform | None                  |         1.03 | *       |

37.3 Hyperparameter Considerations

  1. d_model (or d_embedding):

    • This is the dimension of the embedding space or the number of expected features in the input.
    • All input features are projected into this dimensional space before entering the transformer encoder.
    • This dimension must be divisible by nhead since each head in the multi-head attention mechanism will process a subset of d_model/nhead features.
  2. nhead:

    • This is the number of attention heads in the multi-head attention mechanism.
    • It allows the transformer to jointly attend to information from different representation subspaces.
    • It’s important that d_model % nhead == 0 to ensure the dimensions are evenly split among the heads.
  3. num_encoder_layers:

    • This specifies the number of transformer encoder layers stacked together.
    • Each layer contains a multi-head attention mechanism followed by position-wise feedforward layers.
  4. dim_feedforward:

    • This is the dimension of the feedforward network model within the transformer encoder layer.
    • Typically, this dimension is larger than d_model (e.g., 2048 for a Transformer model with d_model=512).

37.3.1 Important: Constraints and Interconnections:

  • d_model and nhead:
    • As mentioned, d_model must be divisible by nhead. This is critical because each attention head operates simultaneously on a part of the embedding, so d_model/nhead should be an integer.
  • num_encoder_layers and dim_feedforward**:
    • These parameters are more flexible and can be chosen independently of d_model and nhead.
    • However, the choice of dim_feedforward does influence the computational cost and model capacity, as larger dimensions allow learning more complex representations.
  • One hyperparameter does not strictly need to be a multiple of others except for ensuring d_model % nhead == 0.

37.3.2 Practical Considerations:

  1. Setting d_model:

    • Common choices for d_model are powers of 2 (e.g., 256, 512, 1024).
    • Ensure that it matches the size of the input data after the linear projection layer.
  2. Setting nhead:

    • Typically, values are 1, 2, 4, 8, etc., depending on the d_model value.
    • Each head works on a subset of features, so d_model / nhead should be large enough to be meaningful.
  3. Setting num_encoder_layers:

    • Practical values range from 1 to 12 or more depending on the depth desired.
    • Deeper models can capture more complex patterns but are also more computationally intensive.
  4. Setting dim_feedforward:

    • Often set to a multiple of d_model, such as 2048 when d_model is 512.
    • Ensures sufficient capacity in the intermediate layers for complex feature transformations.
Note: d_model Calculation

Since d_model % nhead == 0 is a critical constraint to ensure that the multi-head attention mechanism can operate effectively, spotpython computes the value of d_model based on the nhead value provided by the user. This ensures that the hyperparameter configuration is valid. So, the final value of d_model is a multiple of nhead. spotpython uses the hyperparameter d_model_mult to determine the multiple of nhead to use for d_model, i.e., d_model = nhead * d_model_mult.

Note: dim_feedforward Calculation

Since this dimension is typically larger than d_model (e.g., 2048 for a Transformer model with d_model=512), spotpython uses the hyperparameter dim_feedforward_mult to determine the multiple of d_model to use for dim_feedforward, i.e., dim_feedforward = d_model * dim_feedforward_mult.

37.4 Summary

This section presented an introduction to the basic setup of hyperparameter tuning of a transformer with spotpython and PyTorch Lightning.