from spotoptim import SpotOptim
from spotoptim.function import sphere
PREFIX = "a06"
opt = SpotOptim(
fun=sphere,
bounds=[(-5.0, 5.0)] * 3,
n_initial=16, # one batch fills all 16 workers in parallel
max_iter=80, # total evaluation budget (incl. the initial design)
n_jobs=16, # process pool size on the compute node
eval_batch_size=1, # set > 1 if the objective accepts a batch
seed=0,
verbose=True,
)
opt.save_experiment(prefix=PREFIX, path=".")
# → writes ./a06_exp.pklRunning spotoptim on the GWDG NHR Cluster (Slurm)
This chapter shows how to run a parallel spotoptim optimization on the GWDG NHR cluster from the glogin-p3.hpc.gwdg.de login node, using the standard96s:shared CPU partition with 16 cores and n_jobs=16.
The flow has three phases:
- Locally — build a
SpotOptiminstance, freeze it withsave_experiment(...). The pickle holds the objective, bounds, surrogate,n_jobs, seed, and everything else needed to resume. - On the cluster —
sbatcha thin shell script that loadsuv, then callsSpotOptim.load_experiment(...), runsoptimize(), and writes the result withsave_result(...). - Locally again —
scpthe result back and analyse it withSpotOptim.load_result(...).
The legacy a_06_slurm.qmd chapter (spotpython) needed two scripts on the remote machine: startSlurm.sh plus a startPython.py wrapper that called load_and_run_spot_python_experiment(...). With spotoptim the wrapper is unnecessary: SpotOptim.load_experiment(...) and opt.optimize() are the public API, and parallelism is configured by setting n_jobs on the SpotOptim constructor — there is no separate “control” object.
Prerequisites
- SSH access to
glogin-p3.hpc.gwdg.deas your project useruxxxxx(NHR account names follow the patternu+ five digits; your public key registered via id.academiccloud.de → Security → SSH Public Keys). The login pattern follows the standard GWDG documentation — see docs.hpc.gwdg.de/start_here/connecting. - A
~/.ssh/confighost alias makes the rest of the chapter copy-pastable:
# ~/.ssh/config (local machine)
Host glogin-p3
Hostname glogin-p3.hpc.gwdg.de
User uxxxxxOne-time cluster setup
Log in once and clone the spotoptim repository under $HOME/workspace. The GWDG environment provides uv as a module, so no conda step is needed.
ssh glogin-p3
mkdir -p ~/workspace && cd ~/workspace
git clone https://github.com/sequential-parameter-optimization/spotoptim.git
cd spotoptim
# Compute nodes need an explicit proxy when downloading dependencies.
export http_proxy=http://www-cache.gwdg.de:3128
export https_proxy=http://www-cache.gwdg.de:3128
module purge
module load gcc uv
uv python pin 3.13
uv sync # creates .venv/, installs spotoptim editable
uv run python -c "from spotoptim import SpotOptim; print('ok')"After uv sync succeeds, ~/workspace/spotoptim/.venv/ is the environment the Slurm script will activate via uv run. There is no per-job environment setup — the lock file makes resync only-on-change.
Build the experiment locally
Create a SpotOptim instance with n_jobs=16 and freeze it. The example uses the built-in 3-D sphere test function so that the chapter is reproducible without external data:
save_experiment uses dill so that closures and lambdas survive the round-trip. The output file is named <PREFIX>_exp.pkl. Replace the body of fun=... with any picklable callable to plug your own problem in.
n_jobs and eval_batch_size
n_jobs > 1 activates optimize_steady_state() — see Parallel Optimization for the full data flow. Use -1 to mean “all CPU cores on the worker node”. eval_batch_size collects that many candidate points before a single dispatch to the pool, which is worth setting only when your objective natively handles batched input.
Copy the experiment to the cluster
ssh glogin-p3 'mkdir -p ~/runs/spotoptim/logs'
scp a06_exp.pkl glogin-p3:~/runs/spotoptim/~/runs/spotoptim/ is a convention used in this chapter; pick any directory — just keep logs/ as a sub-directory because the Slurm script writes its .out/.err files there.
The Slurm submission script
The repository ships a reference batch script at scripts/slurm/run_spotoptim.sh. Inline:
#!/bin/bash
#SBATCH --job-name=spotoptim
#SBATCH --partition=standard96s:shared
#SBATCH --cpus-per-task=16
#SBATCH --mem=16G
#SBATCH --time=24:00:00
#SBATCH --output=logs/spotoptim_%j.out
#SBATCH --error=logs/spotoptim_%j.err
#SBATCH --constraint=inet
set -euo pipefail
EXP_PKL="$1"
# GWDG proxy + thread pinning (one BLAS thread per worker process).
export http_proxy=http://www-cache.gwdg.de:3128
export https_proxy=http://www-cache.gwdg.de:3128
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export PYTHONUNBUFFERED=1
mkdir -p logs
module purge 2>/dev/null || true
module load gcc uv
cd "${SPOTOPTIM_REPO:-$HOME/workspace/spotoptim}"
uv run python scripts/slurm/run_spotoptim.py "$EXP_PKL"OMP_NUM_THREADS=1 is mandatory
The 16 worker processes inherit OMP_NUM_THREADS from the batch environment. Without pinning, each worker would launch its own BLAS thread-pool of cpu_count() threads, leading to 16 × 16 = 256 threads on a shared node and severe contention. The graph-elites benchmark in ~/workspace/graph-elites/gwdg/slurm/run_spot_monet.sh uses the same pinning for the same reason.
The Python runner scripts/slurm/run_spotoptim.py is a 30-line wrapper:
import argparse
from pathlib import Path
from spotoptim import SpotOptim
p = argparse.ArgumentParser()
p.add_argument("exp_pkl", type=Path)
args = p.parse_args()
exp_path = args.exp_pkl.resolve()
prefix = exp_path.name.removesuffix("_exp.pkl")
opt = SpotOptim.load_experiment(str(exp_path))
result = opt.optimize()
opt.save_result(prefix=prefix, path=str(exp_path.parent))
print(f"nfev={result.nfev} fun={result.fun:.6g} x={result.x}")optimize() honours the n_jobs value baked into the experiment, so the runner itself never mentions parallelism — it does load → run → save.
Submit the job
ssh glogin-p3
cd ~/runs/spotoptim
sbatch ~/workspace/spotoptim/scripts/slurm/run_spotoptim.sh \
~/runs/spotoptim/a06_exp.pkl
# → Submitted batch job 12345678Pass the experiment path as an absolute path; the Slurm script cds into the spotoptim repo, so a relative path would resolve there instead of in ~/runs/spotoptim/.
Add --qos=2h to the sbatch call when your run fits in 2 hours; the high-priority QoS usually starts within minutes but rejects walltime > 2 h. Override the time at submit-time, not in the script header:
sbatch --qos=2h --time=00:30:00 \
~/workspace/spotoptim/scripts/slurm/run_spotoptim.sh ...Monitor the job
squeue --me
sacct -j <JOBID> --format=JobID,State,Elapsed,MaxRSS,ExitCode
tail -f ~/runs/spotoptim/logs/spotoptim_<JOBID>.outA successful run prints, near the end of the .out file:
=== spotoptim job ===
Job ID : 12345678
CPUs : 16
Mem : 16384 MB
…
nfev=80 fun=0.000123 x=[ 0.0089 -0.0083 0.0027]
=== Job completed at … ===
If you see OUT_OF_MEMORY from sacct, raise --mem (the budget should be roughly n_jobs × 1 GB; spotoptim’s surrogate adds a small constant on top).
Copy the result back and analyse
scp glogin-p3:~/runs/spotoptim/a06_res.pkl .from spotoptim import SpotOptim
opt = SpotOptim.load_result("a06_res.pkl")
print("best fun :", opt.best_y_)
print("best x :", opt.best_x_)
print("nfev :", opt.X_.shape[0])
opt.plot_progress(log_y=True)load_result reinitialises the surrogate and the LHS sampler that were stripped before pickling, so all the analysis methods on SpotOptim (plot_progress, print_results, get_importance, …) work as if the experiment had been run locally.
Slurm command reference
| Command | Description |
|---|---|
sbatch run_spotoptim.sh <prefix>_exp.pkl |
Submit a job that runs optimize() on the supplied pickle. |
sbatch --qos=2h --time=02:00:00 … |
High-priority QoS; faster scheduling, max walltime 2 h. |
squeue --me |
List your queued and running jobs. |
sacct -j <JOBID> --format=JobID,State,Elapsed,MaxRSS,ExitCode |
Per-job accounting (use MaxRSS to right-size --mem). |
scancel <JOBID> |
Cancel a job. |
sinfo -p standard96s:shared |
Node availability on the CPU shared partition. |
module load gcc uv |
Load gcc (often a uv dependency) plus the uv module on the login or compute node. |
scp file glogin-p3:~/runs/spotoptim/ |
Copy a file to the cluster. |
scp glogin-p3:~/runs/spotoptim/<prefix>_res.pkl . |
Copy the result back. |
show-quota |
Show your storage quotas (HOME, project, workspaces). |
See also
- Parallel Optimization — internal control flow when
n_jobs > 1. - GWDG HPC documentation — partitions, QoS, module system, GPU partitions (
grete).