import argparse
import pickle
from math import inf
import torch
from spotpython.utils.file import load_result, load_and_run_spot_python_experiment
from spotpython.data.manydataset import ManyToManyDataset
from spotpython.data.diabetes import Diabetes
from spotpython.hyperdict.light_hyper_dict import LightHyperDict
from spotpython.fun.hyperlight import HyperLight
from spotpython.utils.init import (fun_control_init, surrogate_control_init, design_control_init)
from spotpython.spot import Spot
from spotpython.hyperparameters.values import set_hyperparameter, get_tuned_architecture
from torch.utils.data import TensorDataset
from spotpython.utils.eda import print_res_table
Appendix F — Using Slurm
F.1 Introduction
This chapter describes how to generate a spotpython
configuration on a local machine and run the spotpython
code on a remote machine using Slurm. We recommend using a jupyter notebook (*.ipynb
) or a Quarto document (*.qmd
) on the local machine to generate the configuration and analyze the results.
F.2 Packages important for this Chapter
F.3 Prepare the Slurm Scripts for Runs on the Remote Machine
Two scripts are required to run the spotpython
code on the remote machine:
startSlurm.sh
andstartPython.py
.
They should be saved in the same directory on the remote machine as the pickle-configuration (pkl
) file. These two scripts must be generated only once and can be reused for different configurations. For convenience, the scripts are available as templates here:
The startSlurm.sh
script is a shell script that contains the following code:
#!/bin/bash
### Vergabe von Ressourcen
#SBATCH --job-name=Test
#SBATCH --account=Accountname/Projektname # Hier den gewünschten Account angeben
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:1
#SBATCH --time=48:00:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#----
#SBATCH --partition=gpu
if [ -z "$1" ]; then
"Usage: $0 <path_to_spot.pkl>"
echo 1
exit
fi
=$1
SPOT_PKL
module load conda
### change to your conda environment with spotpython installed via
### pip install spotpython
conda activate spot312
"$SPOT_PKL"
python startPython.py
exit
Save the code in a file named startSlurm.sh
and copy the file to the remote machine via scp
, i.e.,
@144.33.22.1: scp startSlurm.sh user
The startPython.py
script is a Python script that contains the following code:
import argparse
import pickle
from spotpython.utils.file import load_and_run_spot_python_experiment
from spotpython.data.manydataset import ManyToManyDataset
# Uncomment the following if you want to use a custom model (python source code)
# import sys
# sys.path.insert(0, './userModel')
# import my_regressor
# import my_hyper_dict
def main(pickle_file):
= load_and_run_spot_python_experiment(filename=pickle_file)
spot_tuner
if __name__ == "__main__":
= argparse.ArgumentParser(description='Process a pickle file.')
parser 'pickle_file', type=str, help='The path to the pickle file to be processed.')
parser.add_argument(
= parser.parse_args()
args main(args.pickle_file)
Save the code in a file named startPython.py
and copy the file to the remote machine via scp
, i.e.,
@144.33.22.1: scp startPython.py user
F.4 Generate a spotpython
Configuration
The configuration can be generated on a local machine using the following command:
# generate data
= 100_000
num_samples = 100
input_dim = torch.randn(num_samples, input_dim) # random data for example
X = torch.randn(num_samples, 1) # random target for example
Y = TensorDataset(X, Y)
data_set
="a06"
PREFIX
= fun_control_init(
fun_control ="gpu",
accelerator="auto",
devices=1,
num_nodes=19,
num_workers="32",
precision="auto",
strategy=True,
save_experiment=PREFIX,
PREFIX=50,
fun_evals=inf,
max_time= data_set,
data_set ="light.regression.NNLinearRegressor",
core_model_name=LightHyperDict,
hyperdict=input_dim,
_L_in=1)
_L_out
= HyperLight().fun
fun
"optimizer", [ "Adadelta", "Adam", "Adamax"])
set_hyperparameter(fun_control, "l1", [5,10])
set_hyperparameter(fun_control, "epochs", [10,12])
set_hyperparameter(fun_control, "batch_size", [4,11])
set_hyperparameter(fun_control, "dropout_prob", [0.0, 0.025])
set_hyperparameter(fun_control, "patience", [2,9])
set_hyperparameter(fun_control,
= design_control_init(init_size=10)
design_control
= Spot(fun=fun,fun_control=fun_control, design_control=design_control) S
The configuration is saved as a pickle-file that contains the full information. In our example, the filename is a06_exp.pkl
.
save_experiment
The fun_control
dictionary must be initialized with save_experiment=True
to save the experiment/design configuration.
F.5 Copy the Configuration to the Remote Machine
You can copy the configuration to the remote machine using the scp
command. The following command copies the configuration to the remote machine 144.33.22.1
:
@144.33.22.1: scp a06_exp.pkl user
F.6 Run the spotpython
Code on the Remote Machine
Login on the remote machine and run the following command to start the spotpython
code:
@144.33.22.1
ssh user# change this to your conda environment!
conda activate spot312 /startSlurm.sh a06_exp.pkl sbatch .
F.7 Copy the Results to the Local Machine
After the spotpython
code has finished, you can copy the results back to the local machine using the scp
command. The following command copies the results to the local machine:
@144.33.22.1:a06_res.pkl . scp user
spotpython
generates two files:PREFIX_exp.pkl
(experiment file), which stores the information about running the experiment, andPREFIX_res.pkl
(result file), which stores the results of the experiment.
F.8 Analyze the Results on the Local Machine
The file a06_res.pkl
contains the results of the spotpython
code. You can analyze the results on the local machine using the following code. Note: PREFIX
is the same as in the previous steps, i.e., "a06"
.
= load_result(PREFIX) spot_tuner
F.8.1 Visualizing the Tuning Progress
Now the spot_tuner
object is loaded and you can analyze the results interactively.
=True, filename=None) spot_tuner.plot_progress(log_y
F.8.2 Design Table with Default and Tuned Hyperparameters
print_res_table(spot_tuner)
F.8.3 Plotting Important Hyperparameters
=3) spot_tuner.plot_important_hyperparameter_contour(max_imp
F.8.4 The Tuned Hyperparameters
get_tuned_architecture(spot_tuner)
F.9 Slurm Command Reference
Table F.1 summarizes commands used to manage jobs on a remote machine using Slurm.
Command | Description |
---|---|
sbatch startSlurm.sh a06_exp.pkl |
Submit a job to the Slurm scheduler. The job will run the startSlurm.sh script with the argument a06_exp.pkl . |
squeue -u username |
Check the status of your jobs in the queue. Replace username with your actual username. |
scancel job_id |
Cancel a job. Replace job_id with the actual job ID you want to cancel. |
ssh user@remote_host |
Log in to a remote machine. Replace user with your username and remote_host with the hostname or IP address of the remote machine. |
scp source_file user@remote_host:destination_path |
Copy a file to a remote machine. Replace source_file with the path to the file you want to copy, user with your username, remote_host with the hostname or IP address of the remote machine, and destination_path with the path where you want to copy the file on the remote machine. |
module load conda |
Load the Conda module on the remote machine. This command may vary depending on the system configuration. |
conda activate env_name |
Activate a Conda environment. Replace env_name with the name of your Conda environment. |