6  Kriging (Gaussian Process Regression)

6.1 DACE and RSM

Mathematical models implemented in computer codes are used to circumvent the need for expensive field data collection. These models are particularly useful when dealing with highly nonlinear response surfaces, high signal-to-noise ratios (which often involve deterministic evaluations), and a global scope. As a result, a new approach is required in comparison to Response Surface Methodology (RSM), which was discussed in Section 5.1.

With the improvement in computing power and simulation fidelity, researchers gain higher confidence and a better understanding of the dynamics in physical, biological, and social systems. However, the expansion of configuration spaces and increasing input dimensions necessitates more extensive designs. High-performance computing (HPC) allows for thousands of runs, whereas previously only tens were possible. This shift towards larger models and training data presents new computational challenges.

Research questions for DACE (Design and Analysis of Computer Experiments) include how to design computer experiments that make efficient use of computation and how to meta-model computer codes to save on simulation effort. The choice of surrogate model for computer codes significantly impacts the optimal experiment design, and the preferred model-design pairs can vary depending on the specific goal.

The combination of computer simulation, design, and modeling with field data from similar real-world experiments introduces a new category of computer model tuning problems. The ultimate goal is to automate these processes to the greatest extent possible, allowing for the deployment of HPC with minimal human intervention.

One of the remaining differences between RSM and DACE lies in how they handle noise. DACE employs replication, a technique that would not be used in a deterministic setting, to separate signal from noise. Traditional RSM is best suited for situations where a substantial proportion of the variability in the data is due to noise, and where the acquisition of data values can be severely limited. Consequently, RSM is better suited for a different class of problems, aligning with its intended purposes.

Two very good texts on computer experiments and surrogate modeling are Santner, Williams, and Notz (2003) and Forrester, Sóbester, and Keane (2008). The former is the canonical reference in the statistics literature and the latter is perhaps more popular in engineering.

Example 6.1 (Example: DACE and RSM) Imagine you are a chemical engineer tasked with optimizing a chemical process to maximize yield. You can control temperature and pressure, but repeated experiments show variability in yield due to inconsistencies in raw materials.

  • Using RSM: You would use RSM to design a series of experiments varying temperature and pressure. You would then fit a response surface (a mathematical model) to the data, helping you understand how changes in temperature and pressure affect yield. Using this model, you can identify optimal conditions for maximizing yield despite the noise.

  • Using DACE: If instead you use a computational model to simulate the chemical process and want to account for numerical noise or uncertainty in model parameters, you might use DACE. You would run simulations at different conditions, possibly repeating them to assess variability and build a surrogate model that accurately predicts yields, which can be optimized to find the best conditions.

6.1.1 Noise Handling in RSM and DACE

Noise in RSM: In experimental settings, noise often arises due to variability in experimental conditions, measurement errors, or other uncontrollable factors. This noise can significantly affect the response variable, \(Y\). Replication is a standard procedure for handling noise in RSM. In the context of computer experiments, noise might not be present in the traditional sense since simulations can be deterministic. However, variability can arise from uncertainty in input parameters or model inaccuracies. DACE predominantly utilizes advanced interpolation to construct accurate models of deterministic data, sometimes considering statistical noise modeling if needed.

6.2 Data Types and Precision in Python

The float16 data type in numpy represents a half-precision floating point number. It uses 16 bits of memory, which gives it a precision of about 3 decimal digits.

The float32 data type in numpy represents a single-precision floating point number. It uses 32 bits of memory, which gives it a precision of about 7 decimal digits. On the other hand, float64 represents a double-precision floating point number. It uses 64 bits of memory, which gives it a precision of about 15 decimal digits.

The reason float16 and float32 show fewer digits is because it has less precision due to using less memory. The bits of memory are used to store the sign, exponent, and fraction parts of the floating point number, and with fewer bits, you can represent fewer digits accurately.

Example 6.2 (16 versus 32 versus 64 bit)  

import numpy as np

# Define a number
num = 0.123456789123456789

num_float16 = np.float16(num)
num_float32 = np.float32(num)
num_float64 = np.float64(num)

print("float16: ", num_float16) 
print("float32: ", num_float32)
print("float64: ", num_float64)
float16:  0.1235
float32:  0.12345679
float64:  0.12345678912345678

6.3 Cholesky Decomposition and Positive Definite Matrices

We consider the definiteness of a matrix, before discussing the Cholesky decomposition.

Definition 6.1 (Positive Definite Matrix) A symmetric matrix \(A\) is positive definite if all its eigenvalues are positive.

Example 6.3 (Positive Definite Matrix) Given a symmetric matrix \(A = \begin{pmatrix} 9 & 4 \\ 4 & 9 \end{pmatrix}\), the eigenvalues of \(A\) are \(\lambda_1 = 13\) and \(\lambda_2 = 5\). Since both eigenvalues are positive, the matrix \(A\) is positive definite.

Definition 6.2 (Negative Definite, Positive Semidefinite, and Negative Semidefinite Matrices) Similarily, a symmetric matrix \(A\) is negative definite if all its eigenvalues are negative. It is positive semidefinite if all its eigenvalues are non-negative, and negative semidefinite if all its eigenvalues are non-positive.

The covariance matrix must be positive definite for a multivariate normal distribution for a couple of reasons:

  • Semidefinite vs Definite: A covariance matrix is always symmetric and positive semidefinite. However, for a multivariate normal distribution, it must be positive definite, not just semidefinite. This is because a positive semidefinite matrix can have zero eigenvalues, which would imply that some dimensions in the distribution have zero variance, collapsing the distribution in those dimensions. A positive definite matrix has all positive eigenvalues, ensuring that the distribution has positive variance in all dimensions.
  • Invertibility: The multivariate normal distribution’s probability density function involves the inverse of the covariance matrix. If the covariance matrix is not positive definite, it may not be invertible, and the density function would be undefined.

In summary, the covariance matrix being positive definite ensures that the multivariate normal distribution is well-defined and has positive variance in all dimensions.

The definiteness of a matrix can be checked by examining the eigenvalues of the matrix. If all eigenvalues are positive, the matrix is positive definite.

import numpy as np

def is_positive_definite(matrix):
    return np.all(np.linalg.eigvals(matrix) > 0)

matrix = np.array([[9, 4], [4, 9]])
print(is_positive_definite(matrix))  # Outputs: True
True

However, a more efficient way to check the definiteness of a matrix is through the Cholesky decomposition.

Definition 6.3 (Cholesky Decomposition) For a given symmetric positive-definite matrix \(A \in \mathbb{R}^{n \times n}\), there exists a unique lower triangular matrix \(L \in \mathbb{R}^{n \times n}\) with positive diagonal elements such that:

\[ A = L L^T. \]

Here, \(L^T\) denotes the transpose of \(L\).

Example 6.4 (Cholesky decomposition using numpy) linalg.cholesky computes the Cholesky decomposition of a matrix, i.e., it computes a lower triangular matrix \(L\) such that \(LL^T = A\). If the matrix is not positive definite, an error (LinAlgError) is raised.

import numpy as np

# Define a Hermitian, positive-definite matrix
A = np.array([[9, 4], [4, 9]]) 

# Compute the Cholesky decomposition
L = np.linalg.cholesky(A)

print("L = \n", L)
print("L*LT = \n", np.dot(L, L.T))
L = 
 [[3.         0.        ]
 [1.33333333 2.68741925]]
L*LT = 
 [[9. 4.]
 [4. 9.]]

Example 6.5 (Cholesky Decomposition) Given a symmetric positive-definite matrix \(A = \begin{pmatrix} 9 & 4 \\ 4 & 9 \end{pmatrix}\), the Cholesky decomposition computes the lower triangular matrix \(L\) such that \(A = L L^T\). The matrix \(L\) is computed as: \[ L = \begin{pmatrix} 3 & 0 \\ 4/3 & 2 \end{pmatrix}, \] so that \[ L L^T = \begin{pmatrix} 3 & 0 \\ 4/3 & \sqrt{65}/3 \end{pmatrix} \begin{pmatrix} 3 & 4/3 \\ 0 & \sqrt{65}/3 \end{pmatrix} = \begin{pmatrix} 9 & 4 \\ 4 & 9 \end{pmatrix} = A. \]

An efficient implementation of the definiteness-check based on Cholesky is already available in the numpy library. It provides the np.linalg.cholesky function to compute the Cholesky decomposition of a matrix. This more efficient numpy-approach can be used as follows:

import numpy as np

def is_pd(K):
    try:
        np.linalg.cholesky(K)
        return True
    except np.linalg.linalg.LinAlgError as err:
        if 'Matrix is not positive definite' in err.message:
            return False
        else:
            raise
matrix = np.array([[9, 4], [4, 9]])
print(is_pd(matrix))  # Outputs: True
True

6.4 Constructing a Surrogate

Note

This section is based on chapter 2 in Forrester, Sóbester, and Keane (2008).

Definition 6.4 (Black Box Problem) We are trying to learn a mapping that converts the vector \(\mathbf{x}\) into a scalar output \(y\), i.e., we are trying to learn a function \[ y = f(x). \] If function is hidden (“lives in a black box”), so that the physics of the problem is not known, the problem is called a black box problem.

This black box could take the form of either a physical or computer experiment, for example, a finite element code, which calculates the maximum stress (\(\sigma\)) for given product dimensions (\(\mathbf{x}\)).

Definition 6.5 (Generic Solution) The generic solution method is to collect the output values \(y^{(1)}\), \(y^{(2)}\), , \(y^{(n)}\) that result from a set of inputs \(\mathbf{x}^{(1)}\), \(\mathbf{x}^{(2)}\), , \(\mathbf{x}^{(n)}\) and find a best guess \(\hat{f}(\mathbf{x})\) for the black box mapping \(f\), based on these known observations.

6.4.1 Stage One: Preparing the Data and Choosing a Modelling Approach

The first step is the identification, through a small number of observations, of the inputs that have a significant impact on \(f\); that is the determination of the shortest design variable vector \(\mathbf{x} = \{x_1, x_2, \ldots, x_k\}^T\) that, by sweeping the ranges of all of its variables, can still elicit most of the behavior the black box is capable of. The ranges of the various design variables also have to be established at this stage.

The second step is to recruit \(n\) of these \(k\)-vectors into a list \[ \mathbf{X} = \{ \mathbf{x}^{(1)},\mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(n)} \}^T, \] where each \(\mathbf{x}^{(i)}\) is a \(k\)-vector. The corresponding responses are collected in a vector such that this represents the design space as thoroughly as possible.

In the surrogate modeling process, the number of samples \(n\) is often limited, as it is constrained by the computational cost (money and/or time) associated with obtaining each observation.

It is advisable to scale \(\mathbf{x}\) at this stage into the unit cube \([0, 1]^k\), a step that can simplify the subsequent mathematics and prevent multidimensional scaling issues.

We now focus on the attempt to learn \(f\) through data pairs \[ \{ (\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)}) \}. \]

This supervised learning process essentially involves searching across the space of possible functions \(\hat{f}\) that would replicate observations of \(f\). This space of functions is infinite. Any number of hypersurfaces could be drawn to pass through or near the known observations, accounting for experimental error. However, most of these would generalize poorly; they would be practically useless at predicting responses at new sites, which is the ultimate goal.

Example 6.6 (The Needle(s) in the Haystack Function) An extreme example is the ‘needle(s) in the haystack’ function:

\[ f(x) = \begin{cases} y^{(1)}, & \text{if } x = \mathbf{x}^{(1)} \\ y^{(2)}, & \text{if } x = \mathbf{x}^{(2)} \\ \vdots & \\ y^{(n)}, & \text{if } x = \mathbf{x}^{(n)} \\ 0, & \text{otherwise.} \end{cases} \]

While this predictor reproduces all training data, it seems counter-intuitive and unsettling to predict 0 everywhere else for most engineering functions. Although there is a small chance that the function genuinely resembles the equation above and we sampled exactly where the needles are, it is highly unlikely.

There are countless other configurations, perhaps less contrived, that still generalize poorly. This suggests a need for systematic means to filter out nonsensical predictors. In our approach, we embed the structure of \(f\) into the model selection algorithm and search over its parameters to fine-tune the approximation to observations. For instance, consider one of the simplest models, \[ f(x, \mathbf{w}) = \mathbf{w}^T\mathbf{x} + v. \tag{6.1}\] Learning \(f\) with this model implies that its structure—a hyperplane—is predetermined, and the fitting process involves finding the \(k + 1\) parameters (the slope vector \(\mathbf{w}\) and the intercept \(v\)) that best fit the data. This will be accomplished in Stage Two.

Complicating this further is the noise present in observed responses (we assume design vectors \(\mathbf{x}\) are not corrupted). Here, we focus on learning from such data, which sometimes risks overfitting.

Definition 6.6 (Overfitting) Overfitting occurs when the model becomes too flexible and captures not only the underlying trend but also the noise in the data.

In the surrogate modeling process, the second stage as described in Section 6.4.2, addresses this issue of complexity control by estimating the parameters of the fixed structure model. However, foresight is necessary even at the model type selection stage.

Model selection often involves physics-based considerations, where the modeling technique is chosen based on expected underlying responses.

Example 6.7 (Model Selection) Modeling stress in an elastically deformed solid due to small strains may justify using a simple linear approximation. Without insights into the physics, and if one fails to account for the simplicity of the data, a more complex and excessively flexible model may be incorrectly chosen. Although parameter estimation might still adjust the approximation to become linear, an opportunity to develop a simpler and robust model may be lost.

  • Simple linear (or polynomial) models, despite their lack of flexibility, have advantages like applicability in further symbolic computations.
  • Conversely, if we incorrectly assume a quadratic process when multiple peaks and troughs exist, the parameter estimation stage will not compensate for an unsuitable model choice. A quadratic model is too rigid to fit a multimodal function, regardless of parameter adjustments.

6.4.2 Stage Two: Parameter Estimation and Training

Assuming that Stage One helped identify the \(k\) critical design variables, acquire the learning data set, and select a generic model structure \(f(\mathbf{x}, \mathbf{w})\), the task now is to estimate parameters \(\mathbf{w}\) to ensure the model fits the data optimally. Among several estimation criteria, we will discuss two methods here.

Definition 6.7 (Maximum Likelihood Estimation) Given a set of parameters \(\mathbf{w}\), the model \(f(\mathbf{x}, \mathbf{w})\) allows computation of the probability of the data set \[ \{(\mathbf{x}^{(1)}, y^{(1)} \pm \epsilon), (\mathbf{x}^{(2)}, y^{(2)} \pm \epsilon), \ldots, (\mathbf{x}^{(n)}, y^{(n)} \pm \epsilon)\} \] resulting from \(f\) (where \(\epsilon\) is a small error margin around each data point).

Taking Equation 18.5 and assuming errors \(\epsilon\) are independently and normally distributed with standard deviation \(\sigma\), the probability of the data set is given by:

\[ P = \frac{1}{(2\pi \sigma^2)^{n/2}} \exp \left[ -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \left( y^{(i)} - f(\mathbf{x}^{(i)}, \mathbf{w}) \right)^2 \epsilon \right]. \]

Intuitively, this is equivalent to the likelihood of the parameters given the data. Accepting this intuitive relationship as a mathematical one aids in model parameter estimation. This is achieved by maximizing the likelihood or, more conveniently, minimizing the negative of its natural logarithm:

\[ \min_{\mathbf{w}} \sum_{i=1}^{n} \frac{[y^{(i)} - f(\mathbf{x}^{(i)}, \mathbf{w})]^2}{2\sigma^2} + \frac{n}{2} \ln \epsilon . \tag{6.2}\]

If we assume \(\sigma\) and \(\epsilon\) are constants, Equation 6.2 simplifies to the well-known least squares criterion:

\[ \min_{\mathbf{w}} \sum_{i=1}^{n} [y^{(i)} - f(\mathbf{x}^{(i)}, \mathbf{w})]^2 . \]

Cross-validation is another method used to estimate model performance.

Definition 6.8 (Cross-Validation) Cross-validation splits the data randomly into \(q\) roughly equal subsets, and then cyclically removing each subset and fitting the model to the remaining \(q - 1\) subsets. A loss function \(L\) is then computed to measure the error between the predictor and the withheld subset for each iteration, with contributions summed over all \(q\) iterations. More formally, if a mapping \(\theta: \{1, \ldots, n\} \to \{1, \ldots, q\}\) describes the allocation of the \(n\) training points to one of the \(q\) subsets and \(f^{(-\theta(i))}(\mathbf{x})\) is the predicted value by removing the subset \(\theta(i)\) (i.e., the subset where observation \(i\) belongs), the cross-validation measure, used as an estimate of prediction error, is:

\[ CV = \frac{1}{n} \sum_{i=1}^{n} L(y^{(i)}, f^{(-\theta(i))}(\mathbf{x}^{(i)})) . \tag{6.3}\]

Introducing the squared error as the loss function and considering our generic model \(f\) still dependent on undetermined parameters, we write Equation 6.3 as:

\[ CV = \frac{1}{n} \sum_{i=1}^{n} [y^{(i)} - f^{(-\theta(i))}(\mathbf{x}^{(i)})]^2 . \tag{6.4}\]

The extent to which Equation 6.4 is an unbiased estimator of true risk depends on \(q\). It is shown that if \(q = n\), the leave-one-out cross-validation (LOOCV) measure is almost unbiased. However, LOOCV can have high variance because subsets are very similar. Hastie, Tibshirani, and Friedman (2017)) suggest using compromise values like \(q = 5\) or \(q = 10\). Using fewer subsets also reduces the computational cost of the cross-validation process, see also Arlot, Celisse, et al. (2010) and Kohavi (1995).

6.4.3 Stage Three: Model Testing

If there is a sufficient amount of observational data, a random subset should be set aside initially for model testing. Hastie, Tibshirani, and Friedman (2017) recommend setting aside approximately \(0.25n\) of \(\mathbf{x} \rightarrow y\) pairs for testing purposes. These observations must remain untouched during Stages One and Two, as their sole purpose is to evaluate the testing error—the difference between true and approximated function values at the test sites—once the model has been built. Interestingly, if the main goal is to construct an initial surrogate for seeding a global refinement criterion-based strategy (as discussed in Section 3.2 in Forrester, Sóbester, and Keane (2008)), the model testing phase might be skipped.

It is noted that, ideally, parameter estimation (Stage Two) should also rely on a separate subset. However, observational data is rarely abundant enough to afford this luxury (if the function is cheap to evaluate and evaluation sites are selectable, a surrogate model might not be necessary).

When data are available for model testing and the primary objective is a globally accurate model, using either a root mean square error (RMSE) metric or the correlation coefficient (\(r^2\)) is recommended. To test the model, a test data set of size \(n_t\) is used alongside predictions at the corresponding locations to calculate these metrics.

The RMSE is defined as follows:

Definition 6.9 (Root Mean Square Error (RMSE)) \[ \text{RMSE} = \sqrt{\frac{1}{n_t} \sum_{i=1}^{n_t} (y^{(i)} - \hat{y}^{(i)})^2}, \]

Ideally, the RMSE should be minimized, acknowledging its limitation by errors in the objective function \(f\) calculation. If the error level is known, like a standard deviation, the aim might be to achieve an RMSE within this value. Often, the target is an RMSE within a specific percentage of the observed data’s objective value range.

The squared correlation coefficient \(r\), see Equation 18.3, between the observed \(y\) and predicted \(\hat{y}\) values can be computed as:

\[ r^2 = \left( \frac{\text{cov}(y, \hat{y})}{\sqrt{\text{var}(y)\text{var}(\hat{y})}} \right)^2, \tag{6.5}\]

Equation 6.5 and can be expanded as:

\[ r^2 = \left( \frac{n_t \sum_{i=1}^{n_t} y^{(i)} \hat{y}^{(i)} - \sum_{i=1}^{n_t} y^{(i)} \sum_{i=1}^{n_t} \hat{y}^{(i)}}{ \sqrt{\left( n_t \sum_{i=1}^{n_t} (y^{(i)})^2 - \left(\sum_{i=1}^{n_t} y^{(i)}\right)^2 \right) \left( n_t \sum_{i=1}^{n_t} (\hat{y}^{(i)})^2 - \left(\sum_{i=1}^{n_t} \hat{y}^{(i)}\right)^2 \right)}} \right)^2. \]

The correlation coefficient \(r^2\) does not require scaling the data sets and only compares landscape shapes, not values. An \(r^2 > 0.8\) typically indicates a surrogate with good predictive capability.

The methods outlined provide quantitative assessments of model accuracy, yet visual evaluations can also be insightful. In general, the RMSE will not reach zero but will stabilize around a low value. At this point, the surrogate model is saturated with data, and further additions do not enhance the model globally (though local improvements can occur at newly added points if using an interpolating model).

Example 6.8 (The Tea and Sugar Analogy) Forrester, Sóbester, and Keane (2008) illustrates this saturation point using a comparision with a cup of tea and sugar. The tea represents the surrogate model, and sugar represents data. Initially, the tea is unsweetened, and adding sugar increases its sweetness. Eventually, a saturation point is reached where no more sugar dissolves, and the tea cannot get any sweeter. Similarly, a more flexible model, like one with additional parameters or employing interpolation rather than regression, can increase the saturation point—akin to making a hotter cup of tea for dissolving more sugar.

6.5 Sampling Plans

Definition 6.10 (Sampling Plan) In the context of computer experiments, the term “sampling plan” refers to the set of input values at which the computer code is evaluated.

The goal of a sampling plan is to efficiently explore the input space to understand the behavior of the computer code and build a surrogate model that accurately represents the code’s behavior. Traditionally, Response Surface Methodology (RSM) has been used to design sampling plans for computer experiments. These sampling plans are based on procedures that generate points by means of a rectangiular grid or a factorial design. However, more recently, Design and Analysis of Computer Experiments (DACE) has emerged as a more flexible and powerful approach for designing sampling plans. spotpython uses a class for generating space-filling designs using Latin Hypercube Sampling (LHS) and maximin distance criteria. It is based on scipy’s LatinHypercube class. The following example demonstrates how to generate a Latin Hypercube Sampling design using spotpython. The result is shown in Figure 6.1. As can seen in the figure, a Latin hypercube sample generates \(n\) points in \([0,1)^{d}\). Each univariate marginal distribution is stratified, placing exactly one point in \([j/n, (j+1)/n)\) for \(j=0,1,...,n-1\).

import matplotlib.pyplot as plt
import numpy as np
from spotpython.design.spacefilling import SpaceFilling
lhd = SpaceFilling(k=2, seed=123)
X = lhd.scipy_lhd(n=10, repeats=1, lower=np.array([0, 0]), upper=np.array([10, 10]))
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel('x1')
plt.ylabel('x2')
plt.grid()
Figure 6.1: Latin Hypercube Sampling design (sampling plan)

6.6 Kriging

6.6.1 The Kriging Idea in a Nutshell

Kriging can be applied to planned experiments, where the design is based on a sampling plan as shown in Figure 6.1, as well as to computer experiments, where the design is based on the computer code’s input space, as shown in Figure 6.2.

Figure 6.2: Eight measurements of an unknown function. No sampling plan was used.

Kriging can be explained using the concept of radial basis functions.

Radial basis functions (RBFs) are a class of functions used in various types of interpolation and approximation tasks. An RBF is a real-valued function whose value depends only on the distance from a certain point, called the center, usually in a multidimensional space. This distance is typically measured using the Euclidean distance.

6.6.2 Radial Basis Function (RBF)

Mathematically, a radial basis function \(\phi\) can be expressed as:

\[ \phi(\mathbf{x}) = \phi(\|\mathbf{x} - \mathbf{c}\|), \] where \(\mathbf{x}\) is the input vector, \(\mathbf{c}\) is the center of the function, and \(\|\mathbf{x} - \mathbf{c}\|\) denotes the Euclidean distance between \(\mathbf{x}\) and \(\mathbf{c}\).

Common types of radial basis functions include:

  • Linear: \(\phi(r) = r\).
  • Cubic: \(\phi(r) = r^3\),
  • Gaussian: \(\phi(r) = e^{-(\epsilon r)^2}\),

where \(r\) is the distance and \(\epsilon\) is a parameter that determines the width of the Gaussian function.

Radial basis functions are widely used in interpolation for scattered data points in multiple dimensions, in machine learning models such as radial basis function networks, and in numerical solutions to partial differential equations due to their smooth approximation properties.

In general, we onsider observed data of an unknown function \(f\) at \(n\) points \(x_1, \ldots, x_n\). These measurements a considered as realizations of MVN random variables \(Y_1, \ldots, Y_n\) with mean \(\mu\) and covariance matrix \(\Sigma_n\) as shown in Figure 6.8, Figure 6.9 or Figure 6.10.

In Kriging, a more general covariance matrix (or equivalently, a correlation matrix \(\Psi\)) is used, see Equation 6.6. Using a maximum likelihood approach, we can estimate the unknown parameters \(\mu\) and \(\Sigma_n\) from the data so that the likelihood function is maximized.

Definition 6.11 (The Kriging Basis Functions) Kriging uses \(k\)-dimensional basis functions of the form \[ \psi(\vec{x}^{(i)}, \vec{x}^{(j)}) = \exp \left( - \sum_{l=1}^k \theta_l | x_{l}^{(i)} - x_{l}^{(j)} | ^{p_l} \right), \tag{6.6}\] where \(\vec{x}^{(i)}\) denotes the \(k\)-dim vector \(\vec{x}^{(i)}= (x_1^{(i)}, \ldots, x_k^{(i)})^T\).

6.6.3 The Kriging Model

Consider sample data \(\vec{X}\) and \(\vec{y}\) from \(n\) locations that are available in matrix form: \(\vec{X}\) is a \((n \times k)\) matrix, where \(k\) denotes the problem dimension and \(\vec{y}\) is a \((n\times 1)\) vector. We want to find an expression for a predicted values at a new point \(\vec{x}\), denoted as \(\hat{y}\).

We start with an abstract, not really intuitive concept: The observed responses \(\vec{y}\) are considered as if they are from a stochastic process, which will be denoted as \[ \begin{pmatrix} \vec{Y}(\vec{x}^{(1)})\\ \vdots\\ \vec{Y}(\vec{x}^{(n)})\\ \end{pmatrix}. \tag{6.7}\]

The set of random vectors from Equation 6.7 (also referred to as a random field) has a mean of \(\vec{1} \mu\), which is a \((n\times 1)\) vector. The random vectors are correlated with each other using the basis function expression from Equation 6.6: \[ \text{cor} \left(\vec{Y}(\vec{x}^{(i)}),\vec{Y}(\vec{x}^{(l)}) \right) = \exp\left\{ - \sum_{j=1}^k \theta_j |x_j^{(i)} - x_j^{(l)} |^{p_j}\right\}. \tag{6.8}\] Using Equation 6.8, we can compute the \((n \times n)\) correlation matrix \(\vec{\Psi}\) of the observed sample data as shown in Equation 6.9,

\[ \vec{\Psi} = \begin{pmatrix} \text{cor}\left( \vec{Y}(\vec{x}^{(1)}), \vec{Y}(\vec{x}^{(1)}) \right) & \ldots & \text{cor}\left( \vec{Y}(\vec{x}^{(1)}), \vec{Y}(\vec{x}^{(n)}) \right)\\ \vdots & \vdots & \vdots\\ \text{cor}\left( \vec{Y}(\vec{x}^{(n)}), \vec{Y}(\vec{x}^{(1)}) \right)& \ldots & \text{cor}\left( \vec{Y}(\vec{x}^{(n)}), \vec{Y}(\vec{x}^{(n)}) \right) \end{pmatrix}, \tag{6.9}\]

and a covariance matrix as shown in Equation 6.10,

\[ \text{Cov}(\mathbf{Y}, \mathbf{Y} ) = \sigma^2 \mathbf{\Psi}. \tag{6.10}\]

This assumed correlation between the sample data reflects our expectation that an engineering function will behave in a certain way and it will be smoothly and continuous.

We now have a set of \(n\) random variables (\(\mathbf{Y}\)) that are correlated with each other as described in the \((n \times n)\) correlation matrix \(\mathbf{\Psi}\), see Equation 6.9. The correlations depend on the absolute distances in dimension \(j\) between the \(i\)-th and the \(l\)-th sample point \(|x_j^{(i)} - x_j^{(l)}|\) and the corresponding parameters \(p_j\) and \(\theta_j\) for dimension \(j\). The correlation is intuitive, because when

  • two points move close together, then \(|x_j^{(i)} - x_j| \to 0\) and \(\exp \left(-|x_j^{(i)} - x_j|^{p_j} \right) \to 1\) (these points show very close correlation and \(Y(x_j^{(i)}) = Y(x_j)\)).
  • two points move far apart, then \(|x_j^{(i)} - x_j| \to \infty\) and \(\exp \left(-|x_j^{(i)} - x_j|^{p_j} \right) \to 0\) (these points show very low correlation).

Example 6.9 (Correlations for Different \(p_j\)) Three different correlations are shown in Figure 6.3: \(p_j= 0.1, 1, 2\). The smoothness parameter \(p_j\) affects the correlation:

  • With \(p_j=0.1\), there is basicaly no immediate correlation between the points and there is a near discontinuity between the points \(Y(\vec{x}_j^{(i)})\) and \(Y(\vec{x}_j)\).
  • With \(p_j=2\), the correlation is more smooth and we have a continuous gradient through \(x_j^{(i)} - x_j\).

Reducing \(p_j\) increases the rate at which the correlation initially drops with distance. This is shown in Figure 6.3.

Figure 6.3: Correlations with varying \(\theta\). \(\theta\) set to 1/10, 1, and 10.

Example 6.10 (Correlations for Different \(\theta\)) Figure 6.4 visualizes the correlation between two points \(Y(\vec{x}_j^{(i)})\) and \(Y(\vec{x}_j)\) for different values of \(\theta\). The parameter \(\theta\) can be seen as a width parameter:

  • low \(\theta_j\) means that all points will have a high correlation, with \(Y(x_j)\) being similar across the sample.
  • high \(\theta_j\) means that there is a significant difference between the \(Y(x_j)\)’s.
  • \(\theta_j\) is a measure of how active the function we are approximating is.
  • High \(\theta_j\) indicate important parameters, see Figure 6.4.
Figure 6.4: Correlations with varying \(\theta\). \(\theta\) set to 1/10, 1, and 10.

Considering the activity parameter \(\theta\) is useful in high-dimensional problems where it is difficult to visualize the design landscape and the effect of the variable is unknown. By examining the elements of the vector \(\vec{\theta}\), we can identify the most important variables and focus on them. This is a crucial step in the optimization process, as it allows us to reduce the dimensionality of the problem and focus on the most important variables.

Example 6.11 (Example: The Correlation Matrix (Detailed Computation)) Let \(n=4\) and \(k=3\). The sample plan is represented by the following matrix \(X\): \[ X = \begin{pmatrix} x_{11} & x_{12} & x_{13}\\ x_{21} & x_{22} & x_{23}\\ x_{31} & x_{32} & x_{33}\\ x_{41} & x_{42} & x_{43}\\ \end{pmatrix} \]

To compute the elements of the matrix \(\Psi\), the following \(k\) (one for each of the \(k\) dimensions) \((n,n)\)-matrices have to be computed: \[ D_1 = \begin{pmatrix} x_{11} - x_{11} & x_{11} - x_{21} & x_{11} -x_{31} & x_{11} - x_{41} \\ x_{21} - x_{11} & x_{21} - x_{21} & x_{21} -x_{31} & x_{21} - x_{41} \\ x_{31} - x_{11} & x_{31} - x_{21} & x_{31} -x_{31} & x_{31} - x_{41} \\ x_{41} - x_{11} & x_{41} - x_{21} & x_{41} -x_{31} & x_{41} - x_{41} \\ \end{pmatrix} \]

\[ D_2 = \begin{pmatrix} x_{12} - x_{12} & x_{12} - x_{22} & x_{12} -x_{32} & x_{12} - x_{42} \\ x_{22} - x_{12} & x_{22} - x_{22} & x_{22} -x_{32} & x_{22} - x_{42} \\ x_{32} - x_{12} & x_{32} - x_{22} & x_{32} -x_{32} & x_{32} - x_{42} \\ x_{42} - x_{12} & x_{42} - x_{22} & x_{42} -x_{32} & x_{42} - x_{42} \\ \end{pmatrix} \]

\[ D_3 = \begin{pmatrix} x_{13} - x_{13} & x_{13} - x_{23} & x_{13} -x_{33} & x_{13} - x_{43} \\ x_{23} - x_{13} & x_{23} - x_{23} & x_{23} -x_{33} & x_{23} - x_{43} \\ x_{33} - x_{13} & x_{33} - x_{23} & x_{33} -x_{33} & x_{33} - x_{43} \\ x_{43} - x_{13} & x_{43} - x_{23} & x_{43} -x_{33} & x_{43} - x_{43} \\\end{pmatrix} \]

Since the matrices are symmetric and the main diagonals are zero, it is sufficient to compute the following matrices: \[ D_1 = \begin{pmatrix} 0 & x_{11} - x_{21} & x_{11} -x_{31} & x_{11} - x_{41} \\ 0 & 0 & x_{21} -x_{31} & x_{21} - x_{41} \\ 0 & 0 & 0 & x_{31} - x_{41} \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \] \[ D_2 = \begin{pmatrix} 0 & x_{12} - x_{22} & x_{12} -x_{32} & x_{12} - x_{42} \\ 0 & 0 & x_{22} -x_{32} & x_{22} - x_{42} \\ 0 & 0 & 0 & x_{32} - x_{42} \\ 0 & 0 & 0 & 0 \\ \end{pmatrix} \]

\[ D_3 = \begin{pmatrix} 0 & x_{13} - x_{23} & x_{13} -x_{33} & x_{13} - x_{43} \\ 0 & 0 & x_{23} -x_{33} & x_{23} - x_{43} \\ 0 & 0 & 0 & x_{33} - x_{43} \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

We will consider \(p_l=2\). The differences will be squared and multiplied by \(\theta_i\), i.e.:

\[ D_1 = \theta_1 \begin{pmatrix} 0 & (x_{11} - x_{21})^2 & (x_{11} -x_{31})^2 & (x_{11} - x_{41})^2 \\ 0 & 0 & (x_{21} -x_{31})^2 & (x_{21} - x_{41})^2 \\ 0 & 0 & 0 & (x_{31} - x_{41})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

\[ D_2 = \theta_2 \begin{pmatrix} 0 & (x_{12} - x_{22})^2 & (x_{12} -x_{32})^2 & (x_{12} - x_{42})^2 \\ 0 & 0 & (x_{22} -x_{32})^2 & (x_{22} - x_{42})^2 \\ 0 & 0 & 0 & (x_{32} - x_{42})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

\[ D_3 = \theta_3 \begin{pmatrix} 0 & (x_{13} - x_{23})^2 & (x_{13} -x_{33})^2 & (x_{13} - x_{43})^2 \\ 0 & 0 & (x_{23} -x_{33})^2 & (x_{23} - x_{43})^2 \\ 0 & 0 & 0 & (x_{33} - x_{43})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

The sum of the three matrices \(D=D_1+ D_2 + D_3\) will be calculated next:

\[ \begin{pmatrix} 0 & \theta_1 (x_{11} - x_{21})^2 + \theta_2 (x_{12} - x_{22})^2 + \theta_3 (x_{13} - x_{23})^2 & \theta_1 (x_{11} -x_{31})^2 + \theta_2 (x_{12} -x_{32})^2 + \theta_3 (x_{13} -x_{33})^2 & \theta_1 (x_{11} - x_{41})^2 + \theta_2 (x_{12} - x_{42})^2 + \theta_3 (x_{13} - x_{43})^2 \\ 0 & 0 & \theta_1 (x_{21} -x_{31})^2 + \theta_2 (x_{22} -x_{32})^2 + \theta_3 (x_{23} -x_{33})^2 & \theta_1 x_{21} - x_{41})^2 + \theta_2 (x_{22} - x_{42})^2 + \theta_3 (x_{23} - x_{43})^2 \\ 0 & 0 & 0 & \theta_1 (x_{31} - x_{41})^2 + \theta_2 (x_{32} - x_{42})^2 + \theta_3 (x_{33} - x_{43})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

Finally, \[ \Psi = \exp(-D)\] is computed.

Next, we will demonstrate how this computation can be implemented in Python. We will consider four points in three dimensions and compute the correlation matrix \(\Psi\) using the basis function from Equation 6.6. These points are placed at the origin, at the unit vectors, and at the points \((100, 100, 100)\) and \((101, 100, 100)\). So, they form two clusters: one at the origin and one at \((100, 100, 100)\).

from numpy import (array, zeros, power, ones, exp, multiply,
                    eye, linspace, spacing, sqrt, arange,
                    append, ravel)
from numpy.linalg import cholesky, solve
theta = np.array([1,2,3])
X = np.array([ [1,0,0], [0,1,0], [100, 100, 100], [101, 100, 100]])
X
array([[  1,   0,   0],
       [  0,   1,   0],
       [100, 100, 100],
       [101, 100, 100]])
def build_Psi(X, theta):
    n = X.shape[0]
    k = X.shape[1]
    D = zeros((k, n, n))
    for l in range(k):
        for i in range(n):
            for j in range(i, n):
                D[l, i, j] = theta[l]*(X[i,l] - X[j,l])**2
    D = sum(D)
    D = D + D.T
    return exp(-D)  
Psi = build_Psi(X, theta)
Psi
array([[1.        , 0.04978707, 0.        , 0.        ],
       [0.04978707, 1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.36787944],
       [0.        , 0.        , 0.36787944, 1.        ]])

Example 6.12 (Example: The Correlation Matrix (Using Existing Functions)) The same result as computed in the previous example can be obtained with existing python functions, e.g., from the package scipy.

from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist

def build_Psi(X, theta, eps=sqrt(spacing(1))):
    return exp(- squareform(pdist(X,
                            metric='sqeuclidean',
                            out=None,
                            w=theta))) +  multiply(eye(X.shape[0]),
                                                   eps)

Psi = build_Psi(X, theta, eps=.0)
Psi
array([[1.        , 0.04978707, 0.        , 0.        ],
       [0.04978707, 1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.36787944],
       [0.        , 0.        , 0.36787944, 1.        ]])

6.6.4 The Condition Number

A small value, eps, can be passed to the function build_Psi to improve the condition number. For example, eps=sqrt(spacing(1)) can be used. The numpy function spacing() returns the distance between a number and its nearest adjacent number.

The condition number of a matrix is a measure of its sensitivity to small changes in its elements. It is used to estimate how much the output of a function will change if the input is slightly altered.

A matrix with a low condition number is well-conditioned, which means its behavior is relatively stable, while a matrix with a high condition number is ill-conditioned, meaning its behavior is unstable with respect to numerical precision.

import numpy as np

# Define a well-conditioned matrix (low condition number)
A = np.array([[1, 0.1], [0.1, 1]])
print("Condition number of A: ", np.linalg.cond(A))

# Define an ill-conditioned matrix (high condition number)
B = np.array([[1, 0.99999999], [0.99999999, 1]])
print("Condition number of B: ", np.linalg.cond(B))
Condition number of A:  1.2222222222222225
Condition number of B:  200000000.57495335
np.linalg.cond(Psi)
np.float64(2.163953413738652)

6.6.5 MLE to estimate \(\theta\) and \(p\)

Until now, the observed data \(\vec{y}\) was not used. We know what the correlations mean, but how do we estimate the values of \(\theta_j\) and where does our observed data \(y\) come in? To estimate the values of \(\vec{\theta}\) and \(\vec{p}\), they are chosen to maximize the likelihood of \(\vec{y}\), which can be expressed in terms of the sample data \[L\left(\vec{Y}(\vec{x}^{(1)}), \ldots, \vec{Y}(\vec{x}^{(n)}) | \mu, \sigma \right) = \frac{1}{(2\pi \sigma)^{n/2} |\vec{\Psi}|^{1/2}} \exp\left\{ \frac{-(\vec{y} - \vec{1}\mu)^T \vec{\Psi}^{-1}(\vec{y} - \vec{1}\mu) }{2 \sigma^2}\right\},\] and formulated as the log-likelihood: \[ \ln(L) = - \frac{n}{2} \ln(2\pi \sigma) - \frac{1}{2} \ln |\vec{\Psi}| \frac{-(\vec{y} - \vec{1}\mu)^T \vec{\Psi}^{-1}(\vec{y} - \vec{1}\mu) }{2 \sigma^2}. \tag{6.11}\]

Optimization of the log-likelihood by taking derivatives with respect to \(\mu\) and \(\sigma\) results in \[ \hat{\mu} = \frac{\vec{1}^T \vec{\Psi}^{-1} \vec{y}^T}{\vec{1}^T \vec{\Psi}^{-1} \vec{1}^T} \tag{6.12}\] and \[ \hat{\sigma}^2 = \frac{(\vec{y} - \vec{1}\mu)^T \vec{\Psi}^{-1}(\vec{y} - \vec{1}\mu)}{n}. \tag{6.13}\]

Combining the equations, i.e., substituting Equation 6.12 and Equation 6.13 inti Equation 6.11 leads to the concentrated log-likelihood function: \[ \ln(L) = - \frac{n}{2} \ln(\hat{\sigma}) - \frac{1}{2} \ln |\vec{\Psi}|. \tag{6.14}\]

Remark 6.1 (The Concentrated Log-Likelihood).

  • The first term in Equation 6.14 requires information about the measured point (observations) \(y_i\).
  • To maximize \(\ln(L)\), optimal values of \(\vec{\theta}\) and \(\vec{p}\) are determined numerically, because the function (Equation 6.14) is not differentiable.

The concentrated log-likelihood function is very quick to compute. We do not need a statistical model, because we are only interested in the maximum likelihood estimate (MLE) of \(\theta\) and \(p\). Optimizers such as Nelder-Mead, Conjugate Gradient, or Simulated Annealing can be used to determine optimal values for \(\theta\) and \(p\). After the optimization, the correlation matrix \(\Psi\) is build with the optimized \(\theta\) and \(p\) values. This is best (most likely) Kriging model for the given data \(y\).

Observing Figure 6.4, there’s significant change between \(\theta = 0.1\) and \(\theta = 1\), just as there is between \(\theta = 1\) and \(\theta = 10\). Hence, it is sensible to search for \(\theta\) on a logarithmic scale. Suitable search bounds typically range from \(10^{-3}\) to \(10^2\), although this is not a stringent requirement. Importantly, the scaling of the observed data does not affect the values of \(\hat{\theta}\), but the scaling of the design space does. Therefore, it is advisable to consistently scale variable ranges between zero and one to ensure consistency in the degree of activity \(\hat{\theta}_j\) represents across different problems.

Optimizing \(\hat{\phi}\) can enhance prediction accuracy across various problems.

6.6.6 Implementing an MLE of the Model Parameters

The matrix algebra necessary for calculating the likelihood is the most computationally intensive aspect of the Kriging process. It’s crucial to ensure that the code implementation is as efficient as possible.

Given that \(\Psi\) (our correlation matrix) is symmetric, only half of the matrix needs to be computed before adding it to its transpose. When calculating the log-likelihood, several matrix inversions are required. The fastest approach is to conduct one Cholesky factorization and then apply backward and forward substitution for each inverse.

The Cholesky factorization is applicable only to positive-definite matrices, which \(\Psi\) generally is. However, if \(\Psi\) becomes nearly singular, such as when the \(\mathbf{x}^{(i)}\)’s are densely packed, the Cholesky factorization might fail. In these cases, one could employ an LU-decomposition, though the result might be unreliable. When \(\Psi\) is near singular, the best course of action is to either use regression techniques or, as we do here, assign a poor likelihood value to parameters generating the near singular matrix, thus diverting the MLE search towards better-conditioned \(\Psi\) matrices.

Another consideration in calculating the concentrated log-likelihood is that \(\det(\Psi) \rightarrow 0\) for poorly conditioned matrices, so it is advisable to use twice the sum of the logarithms of the diagonal of the Cholesky factorization when calculating \(\ln(\lvert\Psi\rvert)\) in Equation 6.14.

6.6.7 Kriging Prediction

We will use the Kriging correlation \(\Psi\) to predict new values based on the observed data. The matrix algebra involved for calculating the likelihood is the most computationally intensive part of the Kriging process. Care must be taken that the computer code is as efficient as possible.

Basic elements of the Kriging based surrogate optimization such as interpolation, expected improvement, and regression are presented. The presentation follows the approach described in Forrester, Sóbester, and Keane (2008) and Bartz et al. (2022).

Main idea for prediction is that the new \(Y(\vec{x})\) should be consistent with the old sample data \(X\). For a new prediction \(\hat{y}\) at \(\vec{x}\), the value of \(\hat{y}\) is chosen so that it maximizes the likelihood of the sample data \(\vec{X}\) and the prediction, given the (optimized) correlation parameter \(\vec{\theta}\) and \(\vec{p}\) from above. The observed data \(\vec{y}\) is augmented with the new prediction \(\hat{y}\) which results in the augmented vector \(\vec{\tilde{y}} = ( \vec{y}^T, \hat{y})^T\). A vector of correlations between the observed data and the new prediction is defined as

\[ \vec{\psi} = \begin{pmatrix} \text{cor}\left( \vec{Y}(\vec{x}^{(1)}), \vec{Y}(\vec{x}) \right) \\ \vdots \\ \text{cor}\left( \vec{Y}(\vec{x}^{(n)}), \vec{Y}(\vec{x}) \right) \end{pmatrix} = \begin{pmatrix} \vec{\psi}^{(1)}\\ \vdots\\ \vec{\psi}^{(n)} \end{pmatrix}. \]

Definition 6.12 (The Augmented Correlation Matrix) The augmented correlation matrix is constructed as \[ \tilde{\vec{\Psi}} = \begin{pmatrix} \vec{\Psi} & \vec{\psi} \\ \vec{\psi}^T & 1 \end{pmatrix}. \]

The log-likelihood of the augmented data is \[ \ln(L) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\hat{\sigma}^2) - \frac{1}{2} \ln |\vec{\hat{\Psi}}| - \frac{(\vec{\tilde{y}} - \vec{1}\hat{\mu})^T \vec{\tilde{\Psi}}^{-1}(\vec{\tilde{y}} - \vec{1}\hat{\mu})}{2 \hat{\sigma}^2}, \tag{6.15}\]

where \(\vec{1}\) is a vector of ones and \(\hat{\mu}\) and \(\hat{\sigma}^2\) are the MLEs from Equation 6.12 and Equation 6.13. Only the last term in Equation 6.15 depends on \(\hat{y}\), so we need only consider this term in the maximization. Details cen be found in Forrester, Sóbester, and Keane (2008). Finally, the MLE for \(\hat{y}\) can be calculated as \[ \hat{y}(\vec{x}) = \hat{\mu} + \vec{\psi}^T \vec{\tilde{\Psi}}^{-1} (\vec{y} - \vec{1}\hat{\mu}). \tag{6.16}\]

Equation 6.16 reveals two important properties of the Kriging predictor:

  • Basis functions: The basis function impacts the vector \(\vec{\psi}\), which contains the \(n\) correlations between the new point \(\vec{x}\) and the observed locations. Values from the \(n\) basis functions are added to a mean base term \(\mu\) with weightings \[ \vec{w} = \vec{\tilde{\Psi}}^{(-1)} (\vec{y} - \vec{1}\hat{\mu}). \]
  • Interpolation: The predictions interpolate the sample data. When calculating the prediction at the \(i\)th sample point, \(\vec{x}^{(i)}\), the \(i\)th column of \(\vec{\Psi}^{-1}\) is \(\vec{\psi}\), and \(\vec{\psi} \vec{\Psi}^{-1}\) is the \(i\)th unit vector. Hence,

\[ \hat{y}(\vec{x}^{(i)}) = y^{(i)}. \]

6.7 Kriging Example: Sinusoid Function

Toy example in 1d where the response is a simple sinusoid measured at eight equally spaced \(x\)-locations in the span of a single period of oscillation.

6.7.1 Calculating the Correlation Matrix \(\Psi\)

The correlation matrix \(\Psi\) is based on the pairwise squared distances between the input locations. Here we will use \(n=8\) sample locations and \(\theta\) is set to 1.0.

n = 8
X = np.linspace(0, 2*np.pi, n, endpoint=False).reshape(-1,1)
print(np.round(X, 2))
[[0.  ]
 [0.79]
 [1.57]
 [2.36]
 [3.14]
 [3.93]
 [4.71]
 [5.5 ]]

Evaluate at sample points

y = np.sin(X)
print(np.round(y, 2))
[[ 0.  ]
 [ 0.71]
 [ 1.  ]
 [ 0.71]
 [ 0.  ]
 [-0.71]
 [-1.  ]
 [-0.71]]

We have the data points shown in Table 6.1.

Table 6.1: Data points for the sinusoid function
\(x\) \(y\)
0.0 0.0
0.79 0.71
1.57 1.0
2.36 0.71
3.14 0.0
3.93 -0.71
4.71 -1.0
5.5 -0.71

The data points are visualized in Figure 6.5.

import matplotlib.pyplot as plt
plt.plot(X, y, "bo")
plt.title(f"Sin(x) evaluated at {n} points")
plt.grid()
plt.show()
Figure 6.5: Sin(x) evaluated at 8 points.

6.7.2 Computing the \(\Psi\) Matrix

# theta should be an array (of one value, for the moment, will be changed later)
theta = np.array([1.0])
Psi = build_Psi(X, theta)
print(np.round(Psi, 2))
[[1.   0.54 0.08 0.   0.   0.   0.   0.  ]
 [0.54 1.   0.54 0.08 0.   0.   0.   0.  ]
 [0.08 0.54 1.   0.54 0.08 0.   0.   0.  ]
 [0.   0.08 0.54 1.   0.54 0.08 0.   0.  ]
 [0.   0.   0.08 0.54 1.   0.54 0.08 0.  ]
 [0.   0.   0.   0.08 0.54 1.   0.54 0.08]
 [0.   0.   0.   0.   0.08 0.54 1.   0.54]
 [0.   0.   0.   0.   0.   0.08 0.54 1.  ]]

6.7.3 Selecting the New Locations

We would like to predict at \(m = 100\) new locations (or testign locations) in the interval \([0, 2\pi]\). The new locations are stored in the variable x.

m = 100
x = np.linspace(0, 2*np.pi, m, endpoint=False).reshape(-1,1)

6.7.4 Computing the \(\psi\) Vector

Distances between testing locations \(x\) and training data locations \(X\).

from scipy.spatial.distance import cdist

def build_psi(X, x, theta, eps=sqrt(spacing(1))):
    n = X.shape[0]
    k = X.shape[1]
    m = x.shape[0]
    psi = zeros((n, m))
    theta = theta * ones(k)
    D = zeros((n, m))
    D = cdist(x.reshape(-1, k),
              X.reshape(-1, k),
              metric='sqeuclidean',
              out=None,
              w=theta)    
    psi = exp(-D)
    # return psi transpose to be consistent with the literature
    print(f"Dimensions of psi: {psi.T.shape}")
    return(psi.T)

psi = build_psi(X, x, theta)
Dimensions of psi: (8, 100)

6.7.5 Predicting at New Locations

Computation of the predictive equations.

U = cholesky(Psi).T
one = np.ones(n).reshape(-1,1)
mu = (one.T.dot(solve(U, solve(U.T, y)))) / one.T.dot(solve(U, solve(U.T, one)))
f = mu * ones(m).reshape(-1,1) + psi.T.dot(solve(U, solve(U.T, y - one * mu)))
print(f"Dimensions of f: {f.shape}")
Dimensions of f: (100, 1)

To compute \(f\), Equation 6.16 is used.

6.7.6 Visualization

import matplotlib.pyplot as plt
plt.plot(x, f, color = "orange", label="Fitted")
plt.plot(x, np.sin(x), color = "grey", label="Original")
plt.plot(X, y, "bo", label="Measurements")
plt.title("Kriging prediction of sin(x) with {} points.\n theta: {}".format(n, theta[0]))
plt.legend(loc='upper right')
plt.show()

6.8 Cholesky Decomposition

6.8.1 Example of Cholesky Decomposition

We consider dimension \(k=1\) and \(n=2\) sample points. The sample points are located at \(x_1=1\) and \(x_2=5\). The response values are \(y_1=2\) and \(y_2=10\). The correlation parameter is \(\theta=1\) and \(p\) is set to \(1\). Using Equation 6.6, we can compute the correlation matrix \(\Psi\):

\[ \Psi = \begin{pmatrix} 1 & e^{-1}\\ e^{-1} & 1 \end{pmatrix}. \]

To determine MLE as in Equation 6.16, we need to compute \(\Psi^{-1}\):

\[ \Psi^{-1} = \frac{e}{e^2 -1} \begin{pmatrix} e & -1\\ -1 & e \end{pmatrix}. \]

Cholesky-decomposition of \(\Psi\) is recommended to compute \(\Psi^{-1}\). Cholesky decomposition is a decomposition of a positive definite symmetric matrix into the product of a lower triangular matrix \(L\), a diagonal matrix \(D\) and the transpose of \(L\), which is denoted as \(L^T\). Consider the following example:

\[ LDL^T= \begin{pmatrix} 1 & 0 \\ l_{21} & 1 \end{pmatrix} \begin{pmatrix} d_{11} & 0 \\ 0 & d_{22} \end{pmatrix} \begin{pmatrix} 1 & l_{21} \\ 0 & 1 \end{pmatrix}= \]

\[ \begin{pmatrix} d_{11} & 0 \\ d_{11} l_{21} & d_{22} \end{pmatrix} \begin{pmatrix} 1 & l_{21} \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} d_{11} & d_{11} l_{21} \\ d_{11} l_{21} & d_{11} l_{21}^2 + d_{22} \end{pmatrix}. \tag{6.17}\]

Using Equation 6.17, we can compute the Cholesky decomposition of \(\Psi\):

  1. \(d_{11} = 1\),
  2. \(l_{21}d_{11} = e^{-1} \Rightarrow l_{21} = e^{-1}\), and
  3. \(d_{11} l_{21}^2 + d_{22} = 1 \Rightarrow d_{22} = 1 - e^{-2}\).

The Cholesky decomposition of \(\Psi\) is \[ \Psi = \begin{pmatrix} 1 & 0\\ e^{-1} & 1\\ \end{pmatrix} \begin{pmatrix} 1 & 0\\ 0 & 1 - e^{-2}\\ \end{pmatrix} \begin{pmatrix} 1 & e^{-1}\\ 0 & 1\\ \end{pmatrix} = LDL^T\]

Some programs use \(U\) instead of \(L\). The Cholesky decomposition of \(\Psi\) is \[ \Psi = LDL^T = U^TDU. \]

Using \[ \sqrt{D} =\begin{pmatrix} 1 & 0\\ 0 & \sqrt{1 - e^{-2}}\\ \end{pmatrix}, \] we can write the Cholesky decomposition of \(\Psi\) without a diagonal matrix \(D\) as \[ \Psi = \begin{pmatrix} 1 & 0\\ e^{-1} & \sqrt{1 - e^{-2}}\\ \end{pmatrix} \begin{pmatrix} 1 & e^{-1}\\ 0 & \sqrt{1 - e^{-2}}\\ \end{pmatrix} = U^TU. \]

6.8.2 Inverse Matrix Using Cholesky Decomposition

To compute the inverse of a matrix using the Cholesky decomposition, you can follow these steps:

  1. Decompose the matrix \(A\) into \(L\) and \(L^T\), where \(L\) is a lower triangular matrix and \(L^T\) is the transpose of \(L\).
  2. Compute \(L^{-1}\), the inverse of \(L\).
  3. The inverse of \(A\) is then \((L^{-1})^T L^-1\).

Please note that this method only applies to symmetric, positive-definite matrices.

The inverse of the matrix \(\Psi\) from above is:

\[ \Psi^{-1} = \frac{e}{e^2 -1} \begin{pmatrix} e & -1\\ -1 & e \end{pmatrix}. \]

Here’s an example of how to compute the inverse of a matrix using Cholesky decomposition in Python:

import numpy as np
from scipy.linalg import cholesky, inv
E = np.exp(1)

# Psi is a symmetric, positive-definite matrix 
Psi = np.array([[1, 1/E], [1/E, 1]])
L = cholesky(Psi, lower=True)
L_inv = inv(L)
# The inverse of A is (L^-1)^T * L^-1
Psi_inv = np.dot(L_inv.T, L_inv)

print("Psi:\n", Psi)
print("Psi Inverse:\n", Psi_inv)
Psi:
 [[1.         0.36787944]
 [0.36787944 1.        ]]
Psi Inverse:
 [[ 1.15651764 -0.42545906]
 [-0.42545906  1.15651764]]

6.9 Gaussian Processes—Some Background Information

The concept of GP (Gaussian Process) regression can be understood as a simple extension of linear modeling. It is worth noting that this approach goes by various names and acronyms, including “kriging,” a term derived from geostatistics, as introduced by Matheron in 1963. Additionally, it is referred to as Gaussian spatial modeling or a Gaussian stochastic process, and machine learning (ML) researchers often use the term Gaussian process regression (GPR). In all of these instances, the central focus is on regression. This involves training on both inputs and outputs, with the ultimate objective of making predictions and quantifying uncertainty (referred to as uncertainty quantification or UQ).

However, it’s important to emphasize that GPs are not a universal solution for every problem. Specialized tools may outperform GPs in specific, non-generic contexts, and GPs have their own set of limitations that need to be considered.

6.9.1 Gaussian Process Prior

In the context of GP, any finite collection of realizations, which is represented by \(n\) observations, is modeled as having a multivariate normal (MVN) distribution. The characteristics of these realizations can be fully described by two key parameters:

  1. Their mean, denoted as an \(n\)-vector \(\mu\).
  2. The covariance matrix, denoted as an \(n \times n\) matrix \(\Sigma\). This covariance matrix encapsulates the relationships and variability between the individual realizations within the collection.

6.9.2 Covariance Function

The covariance function is defined by inverse exponentiated squared Euclidean distance: \[ \Sigma(\vec{x}, \vec{x}') = \exp\{ - || \vec{x} - \vec{x}'||^2 \}, \] where \(\vec{x}\) and \(\vec{x}'\) are two points in the \(k\)-dimensional input space and \(\| \cdot \|\) denotes the Euclidean distance, i.e., \[ || \vec{x} - \vec{x}'||^2 = \sum_{i=1}^k (x_i - x_i')^2. \]

An 1-d example is shown in Figure 6.6.

Figure 6.6: One-dim inverse exponentiated squared Euclidean distance

The covariance function is also referred to as the kernel function. The Gaussian kernel uses an additional parameter, \(\sigma^2\), to control the rate of decay. This parameter is referred to as the length scale or the characteristic length scale. The covariance function is then defined as

\[ \Sigma(\vec{x}, \vec{x}') = \exp\{ - || \vec{x} - \vec{x}'||^2 / (2 \sigma^2) \}. \tag{6.18}\]

The covariance decays exponentially fast as \(\vec{x}\) and \(\vec{x}'\) become farther apart. Observe that

\[ \Sigma(\vec{x},\vec{x}) = 1 \] and

\[ \Sigma(\vec{x}, \vec{x}') < 1 \] for \(\vec{x} \neq \vec{x}'\). The function \(\Sigma(\vec{x},\vec{x}')\) must be positive definite.

Remark 6.2 (Kriging and Gaussian Basis Functions). The Kriging basis function (Equation 6.6) is related to the 1-dim Gaussian basis function (Equation 6.18), which is defined as \[ \Sigma(\vec{x}^{(i)}, \vec{x}^{(j)}) = \exp\{ - || \vec{x}^{(i)} - \vec{x}^{(j)}||^2 / (2\sigma^2) \}. \tag{6.19}\]

There are some differences between Gaussian basis functions and Kriging basis functions:

  • Where the Gaussian basis function has \(1/(2\sigma^2)\), the Kriging basis has a vector \(\theta = [\theta_1, \theta_2, \ldots, \theta_k]^T\).
  • The \(\theta\) vector allows the width of the basis function to vary from dimension to dimension.
  • In the Gaussian basis function, the exponent is fixed at 2, Kriging allows this exponent \(p_l\) to vary (typically from 1 to 2).

6.9.2.1 Positive Definiteness

Positive definiteness in the context of the covariance matrix \(\Sigma_n\) is a fundamental requirement. It is determined by evaluating \(\Sigma(x_i, x_j)\) at pairs of \(n\) \(\vec{x}\)-values, denoted as \(\vec{x}_1, \vec{x}_2, \ldots, \vec{x}_n\). The condition for positive definiteness is that for all \(\vec{x}\) vectors that are not equal to zero, the expression \(\vec{x}^\top \Sigma_n \vec{x}\) must be greater than zero. This property is essential when intending to use \(\Sigma_n\) as a covariance matrix in multivariate normal (MVN) analysis. It is analogous to the requirement in univariate Gaussian distributions where the variance parameter, \(\sigma^2\), must be positive.

Gaussian Processes (GPs) can be effectively utilized to generate random data that follows a smooth functional relationship. The process involves the following steps:

  1. Select a set of \(\vec{x}\)-values, denoted as \(\vec{x}_1, \vec{x}_2, \ldots, \vec{x}_n\).
  2. Define the covariance matrix \(\Sigma_n\) by evaluating \(\Sigma_n^{ij} = \Sigma(\vec{x}_i, \vec{x}_j)\) for \(i, j = 1, 2, \ldots, n\).
  3. Generate an \(n\)-variate realization \(Y\) that follows a multivariate normal distribution with a mean of zero and a covariance matrix \(\Sigma_n\), expressed as \(Y \sim \mathcal{N}_n(0, \Sigma_n)\).
  4. Visualize the result by plotting it in the \(x\)-\(y\) plane.

6.9.3 Construction of the Covariance Matrix

Here is an one-dimensional example. The process begins by creating an input grid using \(\vec{x}\)-values. This grid consists of 100 elements, providing the basis for further analysis and visualization.

import numpy as np
n = 100
X = np.linspace(0, 10, n, endpoint=False).reshape(-1,1)

In the context of this discussion, the construction of the covariance matrix, denoted as \(\Sigma_n\), relies on the concept of inverse exponentiated squared Euclidean distances. However, it’s important to note that a modification is introduced later in the process. Specifically, the diagonal of the covariance matrix is augmented with a small value, represented as “eps” or \(\epsilon\).

The reason for this augmentation is that while inverse exponentiated distances theoretically ensure the covariance matrix’s positive definiteness, in practical applications, the matrix can sometimes become numerically ill-conditioned. By adding a small value to the diagonal, such as \(\epsilon\), this ill-conditioning issue is mitigated. In this context, \(\epsilon\) is often referred to as “jitter.”

import numpy as np
from numpy import array, zeros, power, ones, exp, multiply, eye, linspace, spacing, sqrt, arange, append, ravel
from numpy.linalg import cholesky, solve
from numpy.random import multivariate_normal
def build_Sigma(X, sigma2):
    n = X.shape[0]
    k = X.shape[1]
    D = zeros((k, n, n))
    for l in range(k):
        for i in range(n):
            for j in range(i, n):
                D[l, i, j] = 1/(2*sigma2[l])*(X[i,l] - X[j,l])**2
    D = sum(D)
    D = D + D.T
    return exp(-D)  
sigma2 = np.array([1.0])
Sigma = build_Sigma(X, sigma2)
np.round(Sigma[:3,:], 3)
array([[1.   , 0.995, 0.98 , 0.956, 0.923, 0.882, 0.835, 0.783, 0.726,
        0.667, 0.607, 0.546, 0.487, 0.43 , 0.375, 0.325, 0.278, 0.236,
        0.198, 0.164, 0.135, 0.11 , 0.089, 0.071, 0.056, 0.044, 0.034,
        0.026, 0.02 , 0.015, 0.011, 0.008, 0.006, 0.004, 0.003, 0.002,
        0.002, 0.001, 0.001, 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   ],
       [0.995, 1.   , 0.995, 0.98 , 0.956, 0.923, 0.882, 0.835, 0.783,
        0.726, 0.667, 0.607, 0.546, 0.487, 0.43 , 0.375, 0.325, 0.278,
        0.236, 0.198, 0.164, 0.135, 0.11 , 0.089, 0.071, 0.056, 0.044,
        0.034, 0.026, 0.02 , 0.015, 0.011, 0.008, 0.006, 0.004, 0.003,
        0.002, 0.002, 0.001, 0.001, 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   ],
       [0.98 , 0.995, 1.   , 0.995, 0.98 , 0.956, 0.923, 0.882, 0.835,
        0.783, 0.726, 0.667, 0.607, 0.546, 0.487, 0.43 , 0.375, 0.325,
        0.278, 0.236, 0.198, 0.164, 0.135, 0.11 , 0.089, 0.071, 0.056,
        0.044, 0.034, 0.026, 0.02 , 0.015, 0.011, 0.008, 0.006, 0.004,
        0.003, 0.002, 0.002, 0.001, 0.001, 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   ]])
import matplotlib.pyplot as plt
plt.imshow(Sigma, cmap='hot', interpolation='nearest')
plt.colorbar()
plt.show()

6.9.4 Generation of Random Samples and Plotting the Realizations of the Random Function

In the context of the multivariate normal distribution, the next step is to utilize the previously constructed covariance matrix denoted as Sigma. It is used as an essential component in generating random samples from the multivariate normal distribution.

The function multivariate_normal is employed for this purpose. It serves as a random number generator specifically designed for the multivariate normal distribution. In this case, the mean of the distribution is set equal to mean, and the covariance matrix is provided as Psi. The argument size specifies the number of realizations, which, in this specific scenario, is set to one.

By default, the mean vector is initialized to zero. To match the number of samples, which is equivalent to the number of rows in the X and Sigma matrices, the argument zeros(n) is used, where n represents the number of samples (here taken from the size of the matrix, e.g.,: Sigma.shape[0]).

rng = np.random.default_rng(seed=12345)
Y = rng.multivariate_normal(zeros(Sigma.shape[0]), Sigma, size = 1, check_valid="raise").reshape(-1,1)
Y.shape
(100, 1)

Now we can plot the results, i.e., a finite realization of the random function \(Y()\) under a GP prior with a particular covariance structure. We will plot those X and Y pairs as connected points on an \(x\)-\(y\) plane.

import matplotlib.pyplot as plt
plt.plot(X, Y)
plt.title("Realization of Random Functions under a GP prior.\n sigma2: {}".format(sigma2[0]))
plt.show()
Figure 6.7: Realization of one random function under a GP prior. sigma2: 1.0
rng = np.random.default_rng(seed=12345)
Y = rng.multivariate_normal(zeros(Sigma.shape[0]), Sigma, size = 3, check_valid="raise")
plt.plot(X, Y.T)
plt.title("Realization of Three Random Functions under a GP prior.\n sigma2: {}".format(sigma2[0]))
plt.show()
Figure 6.8: Realization of three random functions under a GP prior. sigma2: 1.0

6.9.5 Properties of the 1d Example

6.9.5.1 Several Bumps:

In this analysis, we observe several bumps in the \(x\)-range of \([0,10]\). These bumps in the function occur because shorter distances exhibit high correlation, while longer distances tend to be essentially uncorrelated. This leads to variations in the function’s behavior:

  • When \(x\) and \(x'\) are one \(\sigma\) unit apart, the correlation is \(\exp\left(-\sigma^2 / (2\sigma^2)\right) = \exp(-1/2) \approx 0.61\), i.e., a relative high correlation.
  • \(2\sigma\) apart means correlation \(\exp(− 2^2 /2) \approx 0.14\), i.e., only small correlation.
  • \(4\sigma\) apart means correlation \(\exp(− 4^2 /2) \approx 0.0003\), i.e., nearly no correlation—variables are considered independent for almost all practical application.

6.9.5.2 Smoothness:

The function plotted in Figure 6.7 represents only a finite realization, which means that we have data for a limited number of pairs, specifically 100 points. These points appear smooth in a tactile sense because they are closely spaced, and the plot function connects the dots with lines to create the appearance of smoothness. The complete surface, which can be conceptually extended to an infinite realization over a compact domain, is exceptionally smooth in a calculus sense due to the covariance function’s property of being infinitely differentiable.

6.9.5.3 Scale of Two:

Regarding the scale of the \(Y\) values, they have a range of approximately \([-2,2]\), with a 95% probability of falling within this range. In standard statistical terms, 95% of the data points typically fall within two standard deviations of the mean, which is a common measure of the spread or range of data.

import numpy as np
from numpy import array, zeros, power, ones, exp, multiply, eye, linspace, spacing, sqrt, arange, append, ravel
from numpy.random import multivariate_normal

def build_Sigma(X, sigma2):
    n = X.shape[0]
    k = X.shape[1]
    D = zeros((k, n, n))
    for l in range(k):
        for i in range(n):
            for j in range(i, n):
                D[l, i, j] = 1/(2*sigma2[l])*(X[i,l] - X[j,l])**2
    D = sum(D)
    D = D + D.T
    return exp(-D)

def plot_mvn( a=0, b=10, sigma2=1.0, size=1, n=100, show=True):    
    X = np.linspace(a, b, n, endpoint=False).reshape(-1,1)
    sigma2 = np.array([sigma2])
    Sigma = build_Sigma(X, sigma2)
    rng = np.random.default_rng(seed=12345)
    Y = rng.multivariate_normal(zeros(Sigma.shape[0]), Sigma, size = size, check_valid="raise")
    plt.plot(X, Y.T)
    plt.title("Realization of Random Functions under a GP prior.\n sigma2: {}".format(sigma2[0]))
    if show:
        plt.show()
plot_mvn(a=0, b=10, sigma2=10.0, size=3, n=250)
Figure 6.9: Realization of Random Functions under a GP prior. sigma2: 10
plot_mvn(a=0, b=10, sigma2=0.1, size=3, n=250)
Figure 6.10: Realization of Random Functions under a GP prior. sigma2: 0.1

6.10 Jupyter Notebook

Note