[3]{.chapter-number}  [Simulation and Surrogate Modeling]{.chapter-title}

doi:10.48550/arXiv.2307.10262

3 Simulation and Surrogate Modeling

We will consider the interplay between
- mathematical models,
- numerical approximation,
- simulation,
- computer experiments, and
- field data
Experimental design will play a key role in our developments, but not in the classical regression and response surface methodology sense
Challenging real-data/real-simulation examples benefiting from modern surrogate modeling methodology
We will consider the classical, response surface methodology (RSM) approach, and then move on to more modern approaches
All approaches are based on surrogates

3.1 Surrogates

Gathering data is expensive, and sometimes getting exactly the data you want is impossible or unethical
Surrogate: substitute for the real thing
In statistics, draws from predictive equations derived from a fitted model can act as a surrogate for the data-generating mechanism
Benefits of the surrogate approach:
- Surrogate could represent a cheaper way to explore relationships, and entertain “what ifs?”
- Surrogates favor faithful yet pragmatic reproduction of dynamics:
  - interpretation,
  - establishing causality, or
  - identification
- Many numerical simulators are deterministic, whereas field observations are noisy or have measurement error

3.1.1 Costs of Simulation

Computer simulations are generally cheaper (but not always!) than physical observation
Some computer simulations can be just as expensive as field experimentation, but computer modeling is regarded as easier because:
- the experimental apparatus is better understood
- more aspects may be controlled.

3.1.2 Mathematical Models and Meta-Models

Use of mathematical models leveraging numerical solvers has been commonplace for some time
Mathematical models became more complex, requiring more resources to simulate/solve numerically
Practitioners increasingly relied on meta-models built off of limited simulation campaigns

3.1.3 Surrogates = Trained Meta-models

Data collected via expensive computer evaluations tuned flexible functional forms that could be used in lieu of further simulation to
- save money or computational resources;
- cope with an inability to perform future runs (expired licenses, off-line or over-impacted supercomputers)
Trained meta-models became known as surrogates

3.1.4 Computer Experiments

Computer experiment: design, running, and fitting meta-models.
- Like an ordinary statistical experiment, except the data are generated by computer codes rather than physical or field observations, or surveys
Surrogate modeling is statistical modeling of computer experiments

3.1.5 Limits of Mathematical Modeling

Mathematical biologists, economists and others had reached the limit of equilibrium-based mathematical modeling with cute closed-form solutions
Stochastic simulations replace deterministic solvers based on FEM, Navier–Stokes or Euler methods
Agent-based simulation models are used to explore predator-prey (Lotka–Voltera) dynamics, spread of disease, management of inventory or patients in health insurance markets
Consequence: the distinction between surrogate and statistical model is all but gone

3.1.6 Why Computer Simulations are Necessary

You can’t seed a real community with Ebola and watch what happens
If there’s (real) field data, say on a historical epidemic, further experimentation may be almost entirely limited to the mathematical and computer modeling side
Classical statistical methods offer little guidance

3.1.7 Simulation Requirements

Simulation should
- enable rich diagnostics to help criticize that models
- understanding its sensitivity to inputs and other configurations
- providing the ability to optimize and
- refine both automatically and with expert intervention
And it has to do all that while remaining computationally tractable
One perspective is so-called response surface methods (RSMs):
a poster child from industrial statistics’ heyday, well before information technology became a dominant industry

3.2 Applications of Surrogate Models

The four most common usages of surrogate models are:

Augmenting Expensive Simulations: Surrogate models act as a ‘curve fit’ to approximate the results of expensive simulation codes, enabling predictions without rerunning the primary source. This provides significant speed improvements while maintaining useful accuracy.
Calibration of Predictive Codes: Surrogates bridge the gap between simpler, faster but less accurate models and more accurate, slower models. This multi-fidelity approach allows for improved accuracy without the full computational expense.
Handling Noisy or Missing Data: Surrogates smooth out random or systematic errors in experimental or computational data, filling gaps and revealing overall trends while filtering out extraneous details.
Data Mining and Insight Generation: Surrogates help identify functional relationships between variables and their impact on results. They enable engineers to focus on critical variables and visualize data trends effectively.

3.3 DACE and RSM

Mathematical models implemented in computer codes are used to circumvent the need for expensive field data collection. These models are particularly useful when dealing with highly nonlinear response surfaces, high signal-to-noise ratios (which often involve deterministic evaluations), and a global scope. As a result, a new approach is required in comparison to Response Surface Methodology (RSM), which is discussed in Section 6.1.

With the improvement in computing power and simulation fidelity, researchers gain higher confidence and a better understanding of the dynamics in physical, biological, and social systems. However, the expansion of configuration spaces and increasing input dimensions necessitates more extensive designs. High-performance computing (HPC) allows for thousands of runs, whereas previously only tens were possible. This shift towards larger models and training data presents new computational challenges.

Research questions for DACE (Design and Analysis of Computer Experiments) include how to design computer experiments that make efficient use of computation and how to meta-model computer codes to save on simulation effort. The choice of surrogate model for computer codes significantly impacts the optimal experiment design, and the preferred model-design pairs can vary depending on the specific goal.

The combination of computer simulation, design, and modeling with field data from similar real-world experiments introduces a new category of computer model tuning problems. The ultimate goal is to automate these processes to the greatest extent possible, allowing for the deployment of HPC with minimal human intervention.

One of the remaining differences between RSM and DACE lies in how they handle noise. DACE employs replication, a technique that would not be used in a deterministic setting, to separate signal from noise. Traditional RSM is best suited for situations where a substantial proportion of the variability in the data is due to noise, and where the acquisition of data values can be severely limited. Consequently, RSM is better suited for a different class of problems, aligning with its intended purposes.

Two very good texts on computer experiments and surrogate modeling are Santner, Williams, and Notz (2003) and Forrester, Sóbester, and Keane (2008). The former is the canonical reference in the statistics literature and the latter is perhaps more popular in engineering.

Example 3.1 (Example: DACE and RSM) Imagine you are a chemical engineer tasked with optimizing a chemical process to maximize yield. You can control temperature and pressure, but repeated experiments show variability in yield due to inconsistencies in raw materials.

Using RSM: You would use RSM to design a series of experiments varying temperature and pressure. You would then fit a response surface (a mathematical model) to the data, helping you understand how changes in temperature and pressure affect yield. Using this model, you can identify optimal conditions for maximizing yield despite the noise.
Using DACE: If instead you use a computational model to simulate the chemical process and want to account for numerical noise or uncertainty in model parameters, you might use DACE. You would run simulations at different conditions, possibly repeating them to assess variability and build a surrogate model that accurately predicts yields, which can be optimized to find the best conditions.

3.3.1 Noise Handling in RSM and DACE

Noise in RSM: In experimental settings, noise often arises due to variability in experimental conditions, measurement errors, or other uncontrollable factors. This noise can significantly affect the response variable, \(Y\). Replication is a standard procedure for handling noise in RSM. In the context of computer experiments, noise might not be present in the traditional sense since simulations can be deterministic. However, variability can arise from uncertainty in input parameters or model inaccuracies. DACE predominantly utilizes advanced interpolation to construct accurate models of deterministic data, sometimes considering statistical noise modeling if needed.

3.4 Updating a Surrogate Model

A surrogate model is updated by incorporating new data points, known as infill points, into the model to improve its accuracy and predictive capabilities. This process is iterative and involves the following steps:

Identify Regions of Interest: The surrogate model is analyzed to determine areas where it is inaccurate or where further exploration is needed. This could be regions with high uncertainty or areas where the model predicts promising results (e.g., potential optima).
Select Infill Points: Infill points are new data points chosen based on specific criteria, such as:
Exploitation: Sampling near predicted optima to refine the solution. Exploration: Sampling in regions of high uncertainty to improve the model globally. Balanced Approach: Combining exploitation and exploration to ensure both local and global improvements.
Evaluate the True Function: The true function (e.g., a simulation or experiment) is evaluated at the selected infill points to obtain their corresponding outputs.
Update the Surrogate Model: The surrogate model is retrained or updated using the new data, including the infill points, to improve its accuracy.
Repeat: The process is repeated until the model meets predefined accuracy criteria or the computational budget is exhausted.

Definition 3.1 (Infill Points) Infill points are strategically chosen new data points added to the surrogate model. They are selected to:

Reduce uncertainty in the model.
Improve predictions in regions of interest.
Enhance the model’s ability to identify optima or trends.

The selection of infill points is often guided by infill criteria, such as:

Expected Improvement (EI): Maximizing the expected improvement over the current best solution.
Uncertainty Reduction: Sampling where the model’s predictions have high variance.
Probability of Improvement (PI): Sampling where the probability of improving the current best solution is highest.

The iterative infill-points updating process ensures that the surrogate model becomes increasingly accurate and useful for optimization or decision-making tasks.