Step-by-step walkthrough of execute_optimization_run() and every method it calls along the parallel (steady-state) path, with executable examples validated by pytest.
This document traces every step executed by SpotOptim.execute_optimization_run() along the parallel code path (n_jobs > 1), in the order they occur. Each section describes one method or phase with a {python} code block that can be executed directly.
The public entry point is optimize(), which manages the outer restart loop and delegates each cycle to execute_optimization_run(). When n_jobs > 1, that dispatcher routes to optimize_steady_state(), which implements a hybrid steady-state parallelisation strategy that overlaps surrogate search with objective function evaluation. This document covers that path in full.
execute_optimization_run() is the routing layer between the outer restart loop in optimize() and the actual optimisation engine. Its sole responsibility is to examine n_jobs and forward all arguments to either optimize_steady_state() (parallel) or optimize_sequential_run() (sequential). It returns a (status, OptimizeResult) tuple in both cases, which optimize() uses to decide whether to restart or terminate. The optional shared_best_y and shared_lock parameters support inter-worker coordination; they are unused in the current steady-state implementation and reserved for future multi-restart parallelism.
optimize_steady_state() is the parallel orchestrator. It follows the same preparatory sequence as its sequential counterpart — seeding the random number generator, generating and curating the initial design — before switching to a pool-based execution model. The method never returns "RESTART": unlike the sequential loop, the steady-state engine does not implement a zero-success-rate restart criterion, and always exits with "FINISHED". All worker pools are managed inside a single contextlib.ExitStack that guarantees orderly shutdown on success and on exception.
The parallel path begins with the same three preparatory calls as the sequential path. set_seed() re-seeds Python’s random module and NumPy’s global generator to ensure reproducibility. get_initial_design() either processes the user-supplied X0 or generates a Latin Hypercube sample in the transformed, reduced search space. curate_initial_design() removes duplicate points and generates replacements as needed. Because points are dispatched concurrently to worker processes or threads in the next phase, curation must be complete before submission: worker processes operate on a snapshot of the optimiser serialised with dill and cannot interact with the main-process state. Immediately after curation, restart injection is applied: a y0_prefilled array (all NaN by default) is populated with y0_known at the position of the matching self.x0 point, using the same distance tolerance as _initialize_run() in the sequential path. This pre-filled array is consumed by Phase 1 to skip one pool submission per restart.
Step 4 — GIL Detection and Executor Construction (is_gil_disabled())
_no_gil = is_gil_disabled()with ExitStack() as _stack: eval_pool = _stack.enter_context( ThreadPoolExecutor(max_workers=self.n_jobs) if _no_gilelse ProcessPoolExecutor(max_workers=self.n_jobs) ) search_pool = _stack.enter_context( ThreadPoolExecutor(max_workers=self.n_jobs) )
is_gil_disabled() queries sys._is_gil_enabled() (Python 3.13+) to detect whether the interpreter was built without the Global Interpreter Lock. On standard GIL builds — Python 3.12 or GIL-enabled 3.13 — objective evaluations are dispatched to a ProcessPoolExecutor so that each worker runs in a separate process, giving true CPU-level parallelism and safe isolation of arbitrary callables. Surrogate search tasks are always dispatched to a ThreadPoolExecutor because they share the main-process heap and require no serialisation. On free-threaded builds (python3.13t) both pools are ThreadPoolExecutor instances: threads achieve true parallelism without the GIL, dill serialisation is eliminated, and the objective function is called directly from the shared heap. The _surrogate_lock (a threading.Lock) is used in both configurations to serialise concurrent surrogate reads and refits.
n_to_submit =0for i, x inenumerate(X0):if np.isfinite(y0_prefilled[i]):# Restart injection: store directly, skip the pool.self._update_storage_steady(x, y0_prefilled[i])continueif _no_gil: fut = eval_pool.submit(_thread_eval_task_single, x)else: pickled_args = dill.dumps((self, x)) fut = eval_pool.submit(remote_eval_wrapper, pickled_args) futures[fut] ="eval" n_to_submit +=1
The initial design is partitioned into two groups before any worker is touched. Points that carry a pre-filled y0_prefilled value — set during Step 4 for the restart-injected best point — are stored on the main thread via _update_storage_steady() and skipped entirely. The remaining n_to_submit points are submitted to eval_pool concurrently. On GIL builds each point is serialised together with a snapshot of the optimiser using dill.dumps(); the TensorBoard writer is temporarily set to None before serialisation because SummaryWriter objects are not picklable. remote_eval_wrapper() unpickles the arguments in the worker process, reshapes the point to a (1, d) array, calls evaluate_function(), and returns the (x, y) pair. On free-threaded builds _thread_eval_task_single() calls evaluate_function() without any serialisation overhead. All submitted futures are tracked in a futures dictionary keyed by Future object with the string tag "eval". When y0_known is None (no restart), y0_prefilled is all-NaN and n_to_submit equals n_initial, so behaviour is identical to a fresh run.
import timeimport numpy as npfrom spotoptim import SpotOptimfrom spotoptim.function import sphere# Fresh run — all n_initial points are submitted to the pool.opt = SpotOptim(fun=sphere, bounds=[(-5, 5), (-5, 5)], n_initial=8, max_iter=8, seed=0, n_jobs=2)result = opt.optimize()assert result.nfev ==8print(f"evaluations (no injection): {result.nfev}")# Restart injection — the known best point is stored directly, not re-evaluated.x_inject = np.array([0.5, -0.5])opt2 = SpotOptim(fun=sphere, bounds=[(-5, 5), (-5, 5)], n_initial=6, max_iter=6, seed=1, n_jobs=2, x0=x_inject)opt2.optimize_steady_state( timeout_start=time.time(), X0=None, y0_known=float(np.sum(x_inject**2)),)assertfloat(np.sum(x_inject**2)) in opt2.y_print(f"injected value present in y_: {float(np.sum(x_inject**2)):.4f}")print("parallel initial submission check passed.")
evaluations (no injection): 8
injected value present in y_: 0.5000
parallel initial submission check passed.
while initial_done_count <len(X0): done, _ = wait(futures.keys(), return_when=FIRST_COMPLETED)for fut in done: ftype = futures.pop(fut)if ftype !="eval":continue x_done, y_done = fut.result()ifnotisinstance(y_done, Exception):self._update_storage_steady(x_done, y_done) initial_done_count +=1
The main thread waits in a loop using concurrent.futures.wait() with return_when=FIRST_COMPLETED, processing each completed future as it arrives. Futures tagged "eval" are the only type active during Phase 1; any unexpected type is silently skipped. When a result is an Exception instance — indicating a worker-side failure — the point is dropped and, if verbose=True, the error is printed; the initial design count still advances so the loop terminates correctly. For valid results, _update_storage_steady() appends the point in original scale to X_ and its objective value to y_, initialising both arrays on the first call. It also updates best_x_, best_y_, min_y, and min_X in-place whenever the new value improves on the current best. Because the main thread is the only writer during Phase 1, no locking is required here.
Once all initial evaluations have completed, three postprocessing steps are applied in fixed order. _init_tensorboard() logs each initial-design point to TensorBoard as a separate hyperparameter run; when tensorboard_log=False (the default) it is a no-op. A guard check follows: if y_ is None or empty, every worker evaluation failed and a RuntimeError is raised with a diagnostic message pointing to likely causes such as unpicklable callables or missing imports inside the worker process. update_stats() refreshes min_y, min_X, counter, and, for noisy objectives, aggregated per-point means and variances. get_best_xy_initial_design() identifies the initial best solution and writes it to best_x_ and best_y_; unlike the sequential path, these attributes are updated incrementally by _update_storage_steady() throughout the loop, so this call aligns the verbose best-solution display with the true running best after Phase 1.
# No lock needed — no search threads active yetself.fit_scheduler()
After the initial postprocessing, the surrogate model is fitted to the complete initial design for the first time. No surrogate lock is acquired here because the search_pool has not yet been populated: this is the only point in the parallel path where fit_scheduler() is called without holding _surrogate_lock. fit_scheduler() selects the most recent window_size training points according to selection_method, fits the surrogate, and prepares it for acquisition-function queries. When a list of surrogates was specified at construction, one is chosen probabilistically according to prob_surrogate before fitting.
pending_cands: list= []_future_n_pts: dict= {}while (len(self.y_) < effective_max_iter) and ( time.time() < timeout_start +self.max_time *60):if _batch_ready(): _flush_batch()# fill open slots with search tasks# flush again if threshold crossed# wait for any future to complete# route result by ftype: "search" or "batch_eval"
The main iteration loop runs until either the evaluation budget effective_max_iter or the wall-clock limit max_time minutes is exhausted. Two data structures coordinate the flow: pending_cands accumulates candidates returned by completed search tasks, and _future_n_pts maps each in-flight batch-eval future to the number of points it carries. Both structures are required for accurate budget accounting: the reserved count is len(y_) + n_in_flight + n_active_searches + len(pending_cands), ensuring the total never exceeds effective_max_iter regardless of concurrency. Each loop iteration performs four actions in order — batch flush, slot filling, second flush, and future wait — before routing completed results by their type tag.
evaluations : 20
best : 0.000000
steady-state loop check passed.
Step 10 — Batch Readiness (_batch_ready())
def _batch_ready() ->bool:ifnot pending_cands:returnFalseiflen(pending_cands) >=self.eval_batch_size:returnTruereturnnotany(t =="search"for t in futures.values())
_batch_ready() determines whether the accumulated candidates in pending_cands should be dispatched as a batch evaluation. The primary condition is that the number of pending candidates meets or exceeds eval_batch_size. A secondary starvation-guard condition also triggers a flush when no search tasks remain in flight: without this guard, pending candidates would block indefinitely if the budget is nearly exhausted and no further search tasks can be submitted. With eval_batch_size=1 (the default), _batch_ready() returns True whenever any candidate is pending, preserving the one-point-at-a-time behaviour of the sequential path. Larger values of eval_batch_size amortise process-spawn and inter-process communication overhead across multiple points, improving throughput for expensive objectives on GIL builds.
_flush_batch() stacks all pending candidates into a single (n, d) array X_batch, clears pending_cands, and dispatches the batch to eval_pool as one future. On GIL builds remote_batch_eval_wrapper() is used: it unpickles the optimiser and batch in the worker process, calls evaluate_function(X_batch) once, and returns (X_batch, y_batch). Evaluating the whole batch in a single fun() call avoids repeated process-spawn overhead and allows vectorised objective implementations to exploit NumPy-level parallelism within the worker. On free-threaded builds _thread_batch_eval_task() performs the same operation directly in a thread without serialisation. The future is registered under the "batch_eval" tag and its point count is recorded in _future_n_pts so that budget accounting remains correct while the evaluation is in flight.
n_in_flight =sum(_future_n_pts.values())n_searches =sum(1for t in futures.values() if t =="search")reserved =len(self.y_) + n_in_flight + n_searches +len(pending_cands)if reserved < effective_max_iter: fut = search_pool.submit(_thread_search_task) futures[fut] ="search"
At each loop iteration, the main thread fills any open slots in the search pool up to n_jobs concurrent tasks. Before submitting a new search task, the budget guard checks that adding one more in-flight search would not push the reserved count over effective_max_iter: if it would, no further search tasks are submitted and the loop drains existing futures until the budget is consumed. _thread_search_task() is a nested closure that acquires _surrogate_lock before calling suggest_next_infill_point() and releases it on return, so that a concurrent surrogate refit on the main thread cannot corrupt the model while a search thread reads it. Because search tasks are always submitted to a ThreadPoolExecutor, they share the main process heap and require no serialisation regardless of the GIL state.
suggest_next_infill_point() runs the acquisition function to identify the most promising candidate for the next objective evaluation. The acquisition strategy is controlled by acquisition (default "y", minimising the surrogate prediction directly). The acquisition optimiser (default "differential_evolution") searches the transformed, reduced search space. When the optimiser fails, the fallback strategy selected by acquisition_failure_strategy applies; "random" (the default) draws a uniform random point. In the parallel path this method is always called inside _thread_search_task(), which holds _surrogate_lock for the duration of the call, preventing simultaneous surrogate access by multiple search threads or by the main-thread refit.
acquisition : ei
best : 1.083099
suggest_next_infill_point check passed.
Step 14 — Future Completion and Routing
done, _ = wait(futures.keys(), return_when=FIRST_COMPLETED)for fut in done: ftype = futures.pop(fut) res = fut.result()if ftype =="search": x_cand = res pending_cands.append(x_cand)if _batch_ready(): _flush_batch()elif ftype =="batch_eval": _future_n_pts.pop(fut, None) X_done, y_done = res# process batch...
concurrent.futures.wait() with return_when=FIRST_COMPLETED blocks until at least one future finishes, then returns all futures that are done. Each completed future is popped from futures and routed by its type tag. For a "search" future the returned candidate is appended to pending_cands; if this pushes the list over the batch threshold, _flush_batch() is called immediately to avoid an unnecessary extra loop iteration. For a "batch_eval" future the corresponding entry in _future_n_pts is removed so that budget accounting is unblocked before the storage update begins. If a result is an Exception — indicating a remote failure — the error is optionally printed, the budget entry is removed, and the loop continues; failed search slots are refilled in the next iteration, and failed eval points are lost without charging the budget counter.
for xi, yi inzip(X_done, y_done):self.update_success_rate(np.array([yi]))self._update_storage_steady(xi, yi)self.n_iter_ +=1with _surrogate_lock:self.fit_scheduler()
When a batch evaluation completes successfully, the main thread processes every point in the returned (X_done, y_done) pair before refitting the surrogate. For each point, update_success_rate() records whether the new value improves on best_y_, maintaining the rolling success_rate attribute. _update_storage_steady() appends the point to X_ and y_, updates best_x_ and best_y_ if an improvement is found, and synchronises min_y and min_X. n_iter_ is incremented once per point so that it reflects the total number of post-initial-design evaluations. After all points in the batch have been stored, fit_scheduler() is called once under _surrogate_lock, refitting the surrogate on the updated training window. Batching the refit in this way — one call per batch rather than one call per point — improves efficiency when eval_batch_size > 1 and ensures that in-flight search threads always read a self-consistent model.
optimize_steady_state() exits the while loop when len(y_) >= effective_max_iter or the wall-clock limit is exceeded. It always returns "FINISHED" — there is no restart mechanism in the parallel path. The OptimizeResult object carries the best solution (x, fun), total evaluation count (nfev), iteration count (nit), the full evaluation history (X, y), and a fixed termination message that identifies the result as coming from the steady-state engine. The success flag is always True because partial results are still valid whenever at least one evaluation completed successfully; the guard at the end of Phase 1 ensures the method does not return an empty result.