Appendix F — Using Slurm

F.1 Introduction

This chapter describes how to generate a spotpython configuration on a local machine and run the spotpython code on a remote machine using Slurm. We recommend using a jupyter notebook (*.ipynb) or a Quarto document (*.qmd) on the local machine to generate the configuration and analyze the results.

F.2 Packages important for this Chapter

import argparse
import pickle
from math import inf
import torch
from spotpython.utils.file import load_result, load_and_run_spot_python_experiment
from spotpython.data.manydataset import ManyToManyDataset
from spotpython.data.diabetes import Diabetes
from spotpython.hyperdict.light_hyper_dict import LightHyperDict
from spotpython.fun.hyperlight import HyperLight
from spotpython.utils.init import (fun_control_init, surrogate_control_init, design_control_init)
from spotpython.spot import Spot
from spotpython.hyperparameters.values import set_hyperparameter, get_tuned_architecture
from torch.utils.data import TensorDataset
from spotpython.utils.eda import print_res_table

F.3 Prepare the Slurm Scripts for Runs on the Remote Machine

Two scripts are required to run the spotpython code on the remote machine:

  • startSlurm.sh and
  • startPython.py.

They should be saved in the same directory on the remote machine as the pickle-configuration (pkl) file. These two scripts must be generated only once and can be reused for different configurations. For convenience, the scripts are available as templates here:

The startSlurm.sh script is a shell script that contains the following code:

#!/bin/bash
 
### Vergabe von Ressourcen
#SBATCH --job-name=Test
#SBATCH --account=Accountname/Projektname  # Hier den gewünschten Account angeben
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:1
#SBATCH --time=48:00:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#----
#SBATCH --partition=gpu

if [ -z "$1" ]; then
    echo "Usage: $0 <path_to_spot.pkl>"
    exit 1
fi

SPOT_PKL=$1

module load conda

### change to your conda environment with spotpython installed via
### pip install spotpython
conda activate spot312

python startPython.py "$SPOT_PKL"

exit

Save the code in a file named startSlurm.sh and copy the file to the remote machine via scp, i.e.,

scp startSlurm.sh user@144.33.22.1:

The startPython.py script is a Python script that contains the following code:

import argparse
import pickle
from spotpython.utils.file import load_and_run_spot_python_experiment
from spotpython.data.manydataset import ManyToManyDataset

# Uncomment the following if you want to use a custom model (python source code)
# import sys
# sys.path.insert(0, './userModel')
# import my_regressor
# import my_hyper_dict


def main(pickle_file):
    spot_tuner = load_and_run_spot_python_experiment(filename=pickle_file)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Process a pickle file.')
    parser.add_argument('pickle_file', type=str, help='The path to the pickle file to be processed.')

    args = parser.parse_args()
    main(args.pickle_file)

Save the code in a file named startPython.py and copy the file to the remote machine via scp, i.e.,

scp startPython.py user@144.33.22.1:

F.4 Generate a spotpython Configuration

The configuration can be generated on a local machine using the following command:

# generate data
num_samples = 100_000
input_dim = 100
X = torch.randn(num_samples, input_dim)  # random data for example
Y = torch.randn(num_samples, 1)  # random target for example
data_set = TensorDataset(X, Y)

PREFIX="a06"


fun_control = fun_control_init(
    accelerator="gpu",
    devices="auto",
    num_nodes=1,
    num_workers=19,
    precision="32",
    strategy="auto",
    save_experiment=True,
    PREFIX=PREFIX,
    fun_evals=50,
    max_time=inf,
    data_set = data_set,
    core_model_name="light.regression.NNLinearRegressor",
    hyperdict=LightHyperDict,
    _L_in=input_dim,
    _L_out=1)

fun = HyperLight().fun

set_hyperparameter(fun_control, "optimizer", [ "Adadelta", "Adam", "Adamax"])
set_hyperparameter(fun_control, "l1", [5,10])
set_hyperparameter(fun_control, "epochs", [10,12])
set_hyperparameter(fun_control, "batch_size", [4,11])
set_hyperparameter(fun_control, "dropout_prob", [0.0, 0.025])
set_hyperparameter(fun_control, "patience", [2,9])

design_control = design_control_init(init_size=10)

S = Spot(fun=fun,fun_control=fun_control, design_control=design_control)

The configuration is saved as a pickle-file that contains the full information. In our example, the filename is a06_exp.pkl.

WarningNote on save_experiment

The fun_control dictionary must be initialized with save_experiment=True to save the experiment/design configuration.

F.5 Copy the Configuration to the Remote Machine

You can copy the configuration to the remote machine using the scp command. The following command copies the configuration to the remote machine 144.33.22.1:

scp a06_exp.pkl user@144.33.22.1:

F.6 Run the spotpython Code on the Remote Machine

Login on the remote machine and run the following command to start the spotpython code:

ssh user@144.33.22.1
# change this to your conda environment!
conda activate spot312 
sbatch ./startSlurm.sh a06_exp.pkl

F.7 Copy the Results to the Local Machine

After the spotpython code has finished, you can copy the results back to the local machine using the scp command. The following command copies the results to the local machine:

scp user@144.33.22.1:a06_res.pkl .
NoteExperiment and Result Files
  • spotpython generates two files:
    • PREFIX_exp.pkl (experiment file), which stores the information about running the experiment, and
    • PREFIX_res.pkl (result file), which stores the results of the experiment.

F.8 Analyze the Results on the Local Machine

The file a06_res.pkl contains the results of the spotpython code. You can analyze the results on the local machine using the following code. Note: PREFIX is the same as in the previous steps, i.e., "a06".

spot_tuner = load_result(PREFIX)

F.8.1 Visualizing the Tuning Progress

Now the spot_tuner object is loaded and you can analyze the results interactively.

spot_tuner.plot_progress(log_y=True, filename=None)

F.8.2 Design Table with Default and Tuned Hyperparameters

print_res_table(spot_tuner)

F.8.3 Plotting Important Hyperparameters

spot_tuner.plot_important_hyperparameter_contour(max_imp=3)

F.8.4 The Tuned Hyperparameters

get_tuned_architecture(spot_tuner)

F.9 Slurm Command Reference

Table F.1 summarizes commands used to manage jobs on a remote machine using Slurm.

Table F.1: Slurm and related commands
Command Description
sbatch startSlurm.sh a06_exp.pkl Submit a job to the Slurm scheduler. The job will run the startSlurm.sh script with the argument a06_exp.pkl.
squeue -u username Check the status of your jobs in the queue. Replace username with your actual username.
scancel job_id Cancel a job. Replace job_id with the actual job ID you want to cancel.
ssh user@remote_host Log in to a remote machine. Replace user with your username and remote_host with the hostname or IP address of the remote machine.
scp source_file user@remote_host:destination_path Copy a file to a remote machine. Replace source_file with the path to the file you want to copy, user with your username, remote_host with the hostname or IP address of the remote machine, and destination_path with the path where you want to copy the file on the remote machine.
module load conda Load the Conda module on the remote machine. This command may vary depending on the system configuration.
conda activate env_name Activate a Conda environment. Replace env_name with the name of your Conda environment.