6  Kriging (Gaussian Process Regression)

6.1 DACE and RSM

Mathematical models implemented in computer codes are used to circumvent the need for expensive field data collection. These models are particularly useful when dealing with highly nonlinear response surfaces, high signal-to-noise ratios (which often involve deterministic evaluations), and a global scope. As a result, a new approach is required in comparison to Response Surface Methodology (RSM), which was discussed in Section 5.1.

With the improvement in computing power and simulation fidelity, researchers gain higher confidence and a better understanding of the dynamics in physical, biological, and social systems. However, the expansion of configuration spaces and increasing input dimensions necessitates more extensive designs. High-performance computing (HPC) allows for thousands of runs, whereas previously only tens were possible. This shift towards larger models and training data presents new computational challenges.

Research questions for DACE (Design and Analysis of Computer Experiments) include how to design computer experiments that make efficient use of computation and how to meta-model computer codes to save on simulation effort. The choice of surrogate model for computer codes significantly impacts the optimal experiment design, and the preferred model-design pairs can vary depending on the specific goal.

The combination of computer simulation, design, and modeling with field data from similar real-world experiments introduces a new category of computer model tuning problems. The ultimate goal is to automate these processes to the greatest extent possible, allowing for the deployment of HPC with minimal human intervention.

One of the remaining differences between RSM and DACE lies in how they handle noise. DACE employs replication, a technique that would not be used in a deterministic setting, to separate signal from noise. Traditional RSM is best suited for situations where a substantial proportion of the variability in the data is due to noise, and where the acquisition of data values can be severely limited. Consequently, RSM is better suited for a different class of problems, aligning with its intended purposes.

Two very good texts on computer experiments and surrogate modeling are Santner, Williams, and Notz (2003) and Forrester, Sóbester, and Keane (2008). The former is the canonical reference in the statistics literature and the latter is perhaps more popular in engineering.

Example 6.1 (Example: DACE and RSM) Imagine you are a chemical engineer tasked with optimizing a chemical process to maximize yield. You can control temperature and pressure, but repeated experiments show variability in yield due to inconsistencies in raw materials.

  • Using RSM: You would use RSM to design a series of experiments varying temperature and pressure. You would then fit a response surface (a mathematical model) to the data, helping you understand how changes in temperature and pressure affect yield. Using this model, you can identify optimal conditions for maximizing yield despite the noise.

  • Using DACE: If instead you use a computational model to simulate the chemical process and want to account for numerical noise or uncertainty in model parameters, you might use DACE. You would run simulations at different conditions, possibly repeating them to assess variability and build a surrogate model that accurately predicts yields, which can be optimized to find the best conditions.

6.1.1 Noise Handling in RSM and DACE

Noise in RSM: In experimental settings, noise often arises due to variability in experimental conditions, measurement errors, or other uncontrollable factors. This noise can significantly affect the response variable, \(Y\). Replication is a standard procedure for handling noise in RSM. In the context of computer experiments, noise might not be present in the traditional sense since simulations can be deterministic. However, variability can arise from uncertainty in input parameters or model inaccuracies. DACE predominantly utilizes advanced interpolation to construct accurate models of deterministic data, sometimes considering statistical noise modeling if needed.

6.2 Background: Expectation, Mean, Standard Deviation

The distribution of a random vector is characterized by some indexes. These are the expectation, the mean, and the standard deviation. The expectation is a measure of the central tendency of a random variable, while the standard deviation quantifies the spread of the distribution. These indexes are essential for understanding the behavior of random variables and making predictions based on them.

Definition 6.1 (Random Variable) A random variable \(X\) is a mapping from the sample space of a random experiment to the real numbers. It assigns a numerical value to each outcome of the experiment. Random variables can be either:

  • Discrete: If \(X\) takes on a countable number of distinct values.
  • Continuous: If \(X\) takes on an uncountable number of values.

Mathematically, a random variable is a function \(X: \Omega \rightarrow \mathbb{R}\), where \(\Omega\) is the sample space.

Definition 6.2 (Probability Distribution) A probability distribution describes how the values of a random variable are distributed. It is characterized for a discrete random variable \(X\) by the probability mass function (PMF) \(p_X(x)\) and for a continuous random variable \(X\) by the probability density function (PDF) \(f_X(x)\).

Definition 6.3 (Probability Mass Function (PMF)) \(p_X(x) = P(X = x)\) gives the probability that \(X\) takes the value \(x\).

Definition 6.4 (Probability Density Function (PDF):) \(f_X(x)\) is a function such that for any interval \([a, b]\), the probability that \(X\) falls within this interval is given by the integral \(\int_a^b f_X(x) \mathrm{d}x\).

The distribution function must satisfy: \[ \sum_{x \in D_X} p_X(x) = 1 \] for discrete random variables, where \(D_X\) is the domain of \(X\) and \[ \int_{-\infty}^{\infty} f_X(x) \mathrm{d}x = 1 \] for continuous random variables.

With these definitions in place, we can now introduce the definition of the expectation, which is a fundamental measure of the central tendency of a random variable.

Definition 6.5 (Expectation) The expectation or expected value of a random variable \(X\), denoted \(E[X]\), is defined as follows:

For a discrete random variable \(X\): \[ E[X] = \sum_{x \in D_X} x p_X(x) \quad \text{if $X$ is discrete}. \]

For a continuous random variable \(X\): \[ E[X] = \int_{x \in D_X} x f_X(x) \mathrm{d}x \quad \text{if $X$ is continuous.} \]

The mean, \(\mu\), of a probability distribution is a measure of its central tendency or location. That is, \(E(X)\) is defined as the average of all possible values of \(X\), weighted by their probabilities.

Example 6.2 (Expectation) Let \(X\) denote the number produced by rolling a fair die. Then \[ E(X) = 1 \times 1/6 + 2 \times 1/6 + 3 \times 1/6 + 4 \times 1/6 + 5 \times 1/6 + 6\times 1/6 = 3.5. \]

Definition 6.6 (Sample Mean) The sample mean is an important estimate of the population mean. The sample mean of a sample \(\{x_i\}\) (\(i=1,2,\ldots,n\)) is defined as \[ \overline{x} = \frac{1}{n} \sum_i x_i. \]

While both the expectation of a random variable and the sample mean provide measures of central tendency, they differ in their context, calculation, and interpretation.

  • The expectation is a theoretical measure that characterizes the average value of a random variable over an infinite number of repetitions of an experiment. The expectation is calculated using a probability distribution and provides a parameter of the entire population or distribution. It reflects the long-term average or central value of the outcomes generated by the random process.
  • The sample mean is a statistic. It provides an estimate of the population mean based on a finite sample of data. It is computed directly from the data sample, and its value can vary between different samples from the same population. It serves as an approximation or estimate of the population mean. It is used in statistical inference to make conclusions about the population mean based on sample data.

If we are trying to predict the value of a random variable \(X\) by its mean \(\mu = E(X)\), the error will be \(X-\mu\). In many situations it is useful to have an idea how large this deviation or error is. Since \(E(X-\mu) = E(X) -\mu = 0\), it is necessary to use the absolute value or the square of (\(X-\mu\)). The squared error is the first choice, because the derivatives are easier to calculate. These considerations motivate the definition of the variance:

Definition 6.7 (Variance) The variance of a random variable \(X\) is the mean squared deviation of \(X\) from its expected value \(\mu = E(X)\). \[\begin{equation} Var(X) = E[ (X-\mu)^2]. \end{equation}\]

The variance is a measure of the spread of a distribution. It quantifies how much the values of a random variable differ from the mean. A high variance indicates that the values are spread out over a wide range, while a low variance indicates that the values are clustered closely around the mean.

Definition 6.8 (Standard Deviation) Taking the square root of the variance to get back to the same scale of units as \(X\) gives the standard deviation. The standard deviation of \(X\) is the square root of the variance of \(X\). \[\begin{equation} sd(X) = \sqrt{Var(X)}. \end{equation}\]

6.2.1 Calculation of the Standard Deviation with Python

The function numpy.std returns the standard deviation, a measure of the spread of a distribution, of the array elements. The argument ddof specifies the Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero, i.e., std uses the formula \[ \sqrt{ \frac{1}{N} \sum_i \left( x_i - \bar{x} \right)^2 } \qquad \text{with } \quad \bar{x} = \sum_{i=1}^N x_i /N. \]

Example 6.3 (Standard Deviation with Python) Consider the array \([1,2,3]\): Since \(\bar{x} = 2\), the following value is computed: \[ \sqrt{1/3 \times \left( (1-2)^2 + (2-2)^2 + (3-2)^2 \right)} = \sqrt{2/3}.\]

import numpy as np
a = np.array([[1, 2, 3]])
np.std(a)
0.816496580927726

The empirical standard deviation (which uses \(N-1\)), \(\sqrt{1/2 \times \left( (1-2)^2 + (2-2)^2 + (3-2)^2 \right)} = \sqrt{2/2}\), can be calculated in Python as follows:

np.std(a, ddof=1)
1.0

6.2.2 The Argument “axis”

When you compute np.std with axis=0, it calculates the standard deviation along the vertical axis, meaning it computes the standard deviation for each column of the array. On the other hand, when you compute np.std with axis=1, it calculates the standard deviation along the horizontal axis, meaning it computes the standard deviation for each row of the array. If the axis parameter is not specified, np.std computes the standard deviation of the flattened array, i.e., it calculates the standard deviation of all the elements in the array.

Example 6.4 (Axes along which the standard deviation is computed)  

A = np.array([[1, 2], [3, 4]])
A
array([[1, 2],
       [3, 4]])

First, we calculate the standard deviation of all elements in the array:

np.std(A)
1.118033988749895

Setting axis=0 calculates the standard deviation along the vertical axis (column-wise):

np.std(A, axis=0)
array([1., 1.])

Finally, setting axis=1 calculates the standard deviation along the horizontal axis (row-wise):

np.std(A, axis=1)
array([0.5, 0.5])

6.3 Data Types and Precision in Python

The float16 data type in numpy represents a half-precision floating point number. It uses 16 bits of memory, which gives it a precision of about 3 decimal digits.

The float32 data type in numpy represents a single-precision floating point number. It uses 32 bits of memory, which gives it a precision of about 7 decimal digits. On the other hand, float64 represents a double-precision floating point number. It uses 64 bits of memory, which gives it a precision of about 15 decimal digits.

The reason float16 and float32 show fewer digits is because it has less precision due to using less memory. The bits of memory are used to store the sign, exponent, and fraction parts of the floating point number, and with fewer bits, you can represent fewer digits accurately.

Example 6.5 (16 versus 32 versus 64 bit)  

import numpy as np

# Define a number
num = 0.123456789123456789

num_float16 = np.float16(num)
num_float32 = np.float32(num)
num_float64 = np.float64(num)

print("float16: ", num_float16) 
print("float32: ", num_float32)
print("float64: ", num_float64)
float16:  0.1235
float32:  0.12345679
float64:  0.12345678912345678

6.4 Distributions and Random Numbers in Python

Results from computers are deterministic, so it sounds like a contradiction in terms to generate random numbers on a computer. Standard computers generate pseudo-randomnumbers, i.e., numbers that behave as if they were drawn randomly.

Deterministic Random Numbers
  • Idea: Generate deterministically numbers that look (behave) as if they were drawn randomly.

6.4.1 The Uniform Distribution

Definition 6.9 (The Uniform Distribution) The probability density function of the uniform distribution is defined as: \[ f_X(x) = \frac{1}{b-a} \qquad \text{for $x \in [a,b]$}. \]

Generate 10 random numbers from a uniform distribution between \(a=0\) and \(b=1\):

import numpy as np
# Initialize the random number generator
rng = np.random.default_rng(seed=123456789)
n = 10
x = rng.uniform(low=0.0, high=1.0, size=n)
x
array([0.02771274, 0.90670006, 0.88139355, 0.62489728, 0.79071481,
       0.82590801, 0.84170584, 0.47172795, 0.95722878, 0.94659153])

Generate 10,000 random numbers from a uniform distribution between 0 and 10 and plot a histogram of the numbers:

import numpy as np
import matplotlib.pyplot as plt

# Initialize the random number generator
rng = np.random.default_rng(seed=123456789)

# Generate random numbers from a uniform distribution
x = rng.uniform(low=0, high=10, size=10000)

# Plot a histogram of the numbers
plt.hist(x, bins=50, density=True, edgecolor='black')
plt.title('Uniform Distribution [0,10]')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

6.4.2 The Normal Distribution

A normally distributed random variable is a random variable whose associated probability distribution is the normal (or Gaussian) distribution. The normal distribution is a continuous probability distribution characterized by a symmetric bell-shaped curve.

The distribution is defined by two parameters: the mean \(\mu\) and the standard deviation \(\sigma\). The mean indicates the center of the distribution, while the standard deviation measures the spread or dispersion of the distribution.

This distribution is widely used in statistics and the natural and social sciences as a simple model for random variables with unknown distributions.

Definition 6.10 (The Normal Distribution) The probability density function of the normal distribution is defined as: \[ f_X(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2\right), \tag{6.1}\] where: \(\mu\) is the mean; \(\sigma\) is the standard deviation.

To generate ten random numbers from a normal distribution, the following command can be used.

import numpy as np
rng = np.random.default_rng()
n = 10
mu, sigma = 2, 0.1
x = rng.normal(mu, sigma, n)
x
array([2.09115428, 1.92718096, 1.92709163, 2.11163316, 1.96280613,
       2.11871574, 1.81134168, 2.09231401, 1.91625661, 1.975059  ])

Verify the mean:

abs(mu - np.mean(x))
0.00664468035341903

Note: To verify the standard deviation, we use ddof = 1 (empirical standard deviation):

abs(sigma - np.std(x, ddof=1))
0.0044421406874282365
plot_normal_distribution(mu=0, sigma=1, num_samples=10000)

6.4.3 Visualization of the Standard Deviation

The standard deviation of normal distributed can be visualized in terms of the histogram of \(X\):

  • about 68% of the values will lie in the interval within one standard deviation of the mean
  • 95% lie within two standard deviation of the mean
  • and 99.9% lie within 3 standard deviations of the mean.

6.4.4 Standardization of Random Variables

To compare statistical properties of random variables which use different units, it is a common practice to transform these random variables into standardized variables.

Definition 6.11 (Standard Units) If a random variable \(X\) has expectation \(E(X) = \mu\) and standard deviation \(sd(X) = \sigma >0\), the random variable \[ X^{\ast} = (X-\mu)/\sigma \] is called \(X\) in standard units. It has \(E(X^{\ast}) = 0\) and \(sd(X^{\ast}) =1\).

6.4.5 Realizations of a Normal Distribution

Realizations of a normal distribution refers to the actual values that you get when you draw samples from a normal distribution. Each sample drawn from the distribution is a realization of that distribution.

Example 6.6 (Realizations of a Normal Distribution) If you have a normal distribution with a mean of 0 and a standard deviation of 1, each number you draw from that distribution is a realization. Here is a Python example that generates 10 realizations of a normal distribution with a mean of 0 and a standard deviation of 1:

import numpy as np
mu = 0
sigma = 1
realizations = np.random.normal(mu, sigma, 10)
print(realizations)
[ 0.48951662  0.23879586 -0.44811181 -0.610795   -2.02994507  0.60794659
 -0.35410888  0.15258149  0.50127485 -0.78640277]

In this code, np.random.normal generates ten realizations of a normal distribution with a mean of 0 and a standard deviation of 1. The realizations array contains the actual values drawn from the distribution.

6.4.6 The Multivariate Normal Distribution

The multivariate normal, multinormal, or Gaussian distribution serves as a generalization of the one-dimensional normal distribution to higher dimensions. We will consider \(k\)-dimensional random vectors \(X = (X_1, X_2, \ldots, X_k)\). When drawing samples from this distribution, it results in a set of values represented as \(\{x_1, x_2, \ldots, x_k\}\). To fully define this distribution, it is necessary to specify its mean \(\mu\) and covariance matrix \(\Sigma\). These parameters are analogous to the mean, which represents the central location, and the variance (squared standard deviation) of the one-dimensional normal distribution introduced in Equation 6.1.

Definition 6.12 (The Multivariate Normal Distribution) The probability density function (PDF) of the multivariate normal distribution is defined as: \[ f_X(x) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp\left(-\frac{1}{2} (x-\mu)^T\Sigma^{-1} (x-\mu)\right), \] where: \(\mu\) is the \(k \times 1\) mean vector; \(\Sigma\) is the \(k \times k\) covariance matrix. The covariance matrix \(\Sigma\) is assumed to be positive definite, so that its determinant is strictly positive.

In the context of the multivariate normal distribution, the mean takes the form of a coordinate within an \(k\)-dimensional space. This coordinate represents the location where samples are most likely to be generated, akin to the peak of the bell curve in a one-dimensional or univariate normal distribution.

Definition 6.13 (Covariance of two random variables) For two random variables \(X\) and \(Y\), the covariance is defined as the expected value (or mean) of the product of their deviations from their individual expected values: \[ \operatorname{cov}(X, Y) = \operatorname{E}{\big[(X - \operatorname{E}[X])(Y - \operatorname{E}[Y])\big]} \]

For discrete random variables, covariance can be written as: \[ \operatorname{cov} (X,Y) = \frac{1}{n}\sum_{i=1}^n (x_i-E(X)) (y_i-E(Y)). \]

The covariance within the multivariate normal distribution denotes the extent to which two variables vary together. The elements of the covariance matrix, such as \(\Sigma_{ij}\), represent the covariances between the variables \(x_i\) and \(x_j\). These covariances describe how the different variables in the distribution are related to each other in terms of their variability.

Example 6.7 (The Bivariate Normal Distribution with Positive Covariances) Figure 6.1 shows draws from a bivariate normal distribution with \(\mu = \begin{pmatrix}0 \\ 0\end{pmatrix}\) and \(\Sigma=\begin{pmatrix} 9 & 4 \\ 4 & 9 \end{pmatrix}\).

Figure 6.1: Bivariate Normal. Mean zero and covariance \(\Sigma=\begin{pmatrix} 9 & 4 \\ 4 & 9\end{pmatrix}\)

The covariance matrix of a bivariate normal distribution determines the shape, orientation, and spread of the distribution in the two-dimensional space.

The diagonal elements of the covariance matrix (\(\sigma_1^2\), \(\sigma_2^2\)) are the variances of the individual variables. They determine the spread of the distribution along each axis. A larger variance corresponds to a greater spread along that axis.

The off-diagonal elements of the covariance matrix (\(\sigma_{12}, \sigma_{21}\)) are the covariances between the variables. They determine the orientation and shape of the distribution. If the covariance is positive, the distribution is stretched along the line \(y=x\), indicating that the variables tend to increase together. If the covariance is negative, the distribution is stretched along the line \(y=-x\), indicating that one variable tends to decrease as the other increases. If the covariance is zero, the variables are uncorrelated and the distribution is axis-aligned.

In Figure 6.1, the variances are identical and the variables are correlated (covariance is 4), so the distribution is stretched along the line \(y=x\).

Figure 6.2: Bivariate Normal. Mean zero and covariance \(\Sigma=\begin{pmatrix} 9 & 4 \\ 4 & 9\end{pmatrix}\).

Example 6.8 (The Bivariate Normal Distribution with Mean Zero and Zero Covariances) The Bivariate Normal Distribution with Mean Zero and Zero Covariances \(\sigma_{12} = \sigma_{21} = 0\).

\(\Sigma=\begin{pmatrix} 9 & 0 \\ 0 & 9\end{pmatrix}\)

Figure 6.3: Bivariate Normal. Mean zero and covariance \(\Sigma=\begin{pmatrix} 9 & 0 \\ 0 & 9\end{pmatrix}\)

Example 6.9 (The Bivariate Normal Distribution with Mean Zero and Negative Covariances) The Bivariate Normal Distribution with Mean Zero and Negative Covariances \(\sigma_{12} = \sigma_{21} = -4\).

\(\Sigma=\begin{pmatrix} 9 & -4 \\ -4 & 9\end{pmatrix}\)

Figure 6.4: Bivariate Normal. Mean zero and covariance \(\Sigma=\begin{pmatrix} 9 & -4 \\ -4 & 9\end{pmatrix}\)

6.5 Covariance

In statistics, understanding the relationship between random variables is crucial for making inferences and predictions. Two common measures of such relationships are covariance and correlation. Covariance is a measure of how much two random variables change together. If the variables tend to show similar behavior (i.e., when one increases, the other tends to increase), the covariance is positive. Conversely, if they tend to move in opposite directions, the covariance is negative.

Definition 6.14 (Covariance) Covariance is calculated as:

\[ \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] \]

Here, \(E[X]\) and \(E[Y]\) are the expected values (means) of \(X\) and \(Y\), respectively. Covariance has units that are the product of the units of \(X\) and \(Y\).

For a vector of random variables \(\mathbf{Y} = \begin{pmatrix} Y^{(1)}, \ldots, Y^{(n)} \end{pmatrix}^T\), the covariance matrix \(\Sigma\) encapsulates the covariances between each pair of variables:

\[ \Sigma = \text{Cov}(\mathbf{Y}, \mathbf{Y}) = \begin{pmatrix} \text{Var}(Y^{(1)}) & \text{Cov}(Y^{(1)}, Y^{(2)}) & \ldots \\ \text{Cov}(Y^{(2)}, Y^{(1)}) & \text{Var}(Y^{(2)}) & \ldots \\ \vdots & \vdots & \ddots \end{pmatrix} \]

The diagonal elements represent the variances, while the off-diagonal elements are the covariances.

6.6 Correlation

6.6.1 Definitions

Definition 6.15 ((Pearson) Correlation Coefficient) The Pearson correlation coefficient, often denoted by \(\rho\) for the population or \(r\) for a sample, is calculated by dividing the covariance of two variables by the product of their standard deviations.

\[ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}, \tag{6.2}\]

where \(\text{Cov}(X, Y)\) is the covariance between variables \(X\) and \(Y\), and \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\), respectively.

Correlation, specifically the correlation coefficient, is a normalized measure of the linear relationship between two variables. It provides a value ranging from \(-1\) to \(1\), which is scale-free, making it easier to interpret:

  • \(-1\): Perfect negative correlation, indicating that as one variable increases, the other decreases.
  • \(0\): No correlation, indicating no linear relationship between the variables.
  • \(1\): Perfect positive correlation, indicating that both variables increase together.

The correlation matrix \(\Psi\) provides a way to quantify the linear relationship between multiple variables, extending the notion of the correlation coefficient beyond just pairs of variables. It is derived from the covariance matrix \(\Sigma\) by normalizing each element with respect to the variances of the relevant variables.

Definition 6.16 (The Correlation Matrix \(\Psi\)) Given a set of random variables \(X_1, X_2, \ldots, X_n\), the covariance matrix \(\Sigma\) is:

\[ \Sigma = \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1n} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{n1} & \sigma_{n2} & \cdots & \sigma_{nn} \end{pmatrix}, \]

where \(\sigma_{ij} = \text{cov}(X_i, X_j)\) is the covariance between the \(i^{\text{th}}\) and \(j^{\text{th}}\) variables. The correlation matrix \(\Psi\) is then defined as:

\[ \Psi = \begin{pmatrix} \rho_{ij} \end{pmatrix} = \begin{pmatrix} \frac{\sigma_{ij}}{\sqrt{\sigma_{ii} \sigma_{jj}}} \end{pmatrix}, \]

where:

  • \(\rho_{ij}\) is the correlation coefficient between the \(i^{\text{th}}\) and \(j^{\text{th}}\) variables.
  • \(\sigma_{ii}\) is the variance of the \(i^{\text{th}}\) variable, i.e., \(\sigma_i^2\).
  • \(\sqrt{\sigma_{ii}}\) is the standard deviation of the \(i^{\text{th}}\) variable, denoted as \(\sigma_i\).

Thus, \(\Psi\) can also be expressed as:

\[ \Psi = \begin{pmatrix} 1 & \frac{\sigma_{12}}{\sigma_1 \sigma_2} & \cdots & \frac{\sigma_{1n}}{\sigma_1 \sigma_n} \\ \frac{\sigma_{21}}{\sigma_2 \sigma_1} & 1 & \cdots & \frac{\sigma_{2n}}{\sigma_2 \sigma_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\sigma_{n1}}{\sigma_n \sigma_1} & \frac{\sigma_{n2}}{\sigma_n \sigma_2} & \cdots & 1 \end{pmatrix} \]

The correlation matrix \(\Psi\) has the following properties:

  • The matrix \(\Psi\) is symmetric, meaning \(\rho_{ij} = \rho_{ji}\).
  • The diagonal elements are all 1, as \(\rho_{ii} = \frac{\sigma_{ii}}{\sigma_i \sigma_i} = 1\).
  • Each off-diagonal element is constrained between -1 and 1, indicating the strength and direction of the linear relationship between pairs of variables.

6.6.2 Computations

Example 6.10 (Computing a Correlation Matrix) Suppose you have a dataset consisting of three variables: \(X\), \(Y\), and \(Z\). You can compute the correlation matrix as follows:

  1. Calculate the covariance matrix \(\Sigma\), which contains covariances between all pairs of variables.
  2. Extract the standard deviations for each variable from the diagonal elements of \(\Sigma\).
  3. Use the standard deviations to compute the correlation matrix \(\Psi\).

Suppose we have two sets of data points:

  • \(X = [1, 2, 3]\)
  • \(Y = [4, 5, 6]\)

We want to compute the correlation matrix \(\Psi\) for these variables. First, calculate the mean of each variable.

\[ \bar{X} = \frac{1 + 2 + 3}{3} = 2 \]

\[ \bar{Y} = \frac{4 + 5 + 6}{3} = 5 \]

Second, compute the covariance between \(X\) and \(Y\). The covariance is calculated as:

\[ \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) \]

For our data:

\[ \text{Cov}(X, Y) = \frac{1}{3-1} \left[(1 - 2)(4 - 5) + (2 - 2)(5 - 5) + (3 - 2)(6 - 5)\right] \]

\[ = \frac{1}{2} \left[1 + 0 + 1\right] = 1 \]

Third, calculate the variances of \(X\) and \(Y\). Variance is calculated as:

\[ \text{Var}(X) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2 \]

\[ = \frac{1}{2} \left[(1-2)^2 + (2-2)^2 + (3-2)^2\right] = \frac{1}{2} (1 + 0 + 1) = 1 \]

Similarly,

\[ \text{Var}(Y) = \frac{1}{2} \left[(4-5)^2 + (5-5)^2 + (6-5)^2\right] = \frac{1}{2} (1 + 0 + 1) = 1 \]

Then, compute the correlation coefficient. The correlation coefficient \(\rho_{XY}\) is:

\[ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)} \cdot \sqrt{\text{Var}(Y)}} \]

\[ = \frac{1}{\sqrt{1} \cdot \sqrt{1}} = 1 \]

Finally, construct the correlation matrix. The correlation matrix \(\Psi\) is given as:

\[ \Psi = \begin{pmatrix} 1 & \rho_{XY} \\ \rho_{XY} & 1 \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix} \]

Thus, for these two variables, the correlation matrix indicates a perfect positive linear relationship (correlation coefficient of 1) between \(X\) and \(Y\).

6.6.3 The Outer-product and the np.outer Function

The function np.outer from the NumPy library computes the outer product of two vectors. The outer product of two vectors results in a matrix, where each element is the product of an element from the first vector and an element from the second vector.

Definition 6.17 (Outer Product) For two vectors \(\mathbf{a}\) and \(\mathbf{b}\), the outer product is defined in terms of their elements as:

\[ \text{outer}(\mathbf{a}, \mathbf{b}) = \begin{pmatrix} a_1 \cdot b_1 & a_1 \cdot b_2 & \cdots & a_1 \cdot b_n \\ a_2 \cdot b_1 & a_2 \cdot b_2 & \cdots & a_2 \cdot b_n \\ \vdots & \vdots & \ddots & \vdots \\ a_m \cdot b_1 & a_m \cdot b_2 & \cdots & a_m \cdot b_n \end{pmatrix}, \]

where \(\mathbf{a}\) is a vector of length \(m\) and \(\mathbf{b}\) is a vector of length \(n\).

Example 6.11 (Computing the Outer Product) We will consider two vectors, \(\mathbf{a}\) and \(\mathbf{b}\):

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5])
outer_product = np.outer(a, b)
print("Vector a:", a)
print("Vector b:", b)
print("Outer Product:\n", outer_product)
Vector a: [1 2 3]
Vector b: [4 5]
Outer Product:
 [[ 4  5]
 [ 8 10]
 [12 15]]

For the vectors defined:

\[ \mathbf{a} = [1, 2, 3], \quad \mathbf{b} = [4, 5] \]

The outer product will be:

\[ \begin{pmatrix} 1 \cdot 4 & 1 \cdot 5 \\ 2 \cdot 4 & 2 \cdot 5 \\ 3 \cdot 4 & 3 \cdot 5 \end{pmatrix} = \begin{pmatrix} 4 & 5 \\ 8 & 10 \\ 12 & 15 \end{pmatrix}. \]

Thus, np.outer creates a matrix with dimensions \(m \times n\), where \(m\) is the length of the first vector and \(n\) is the length of the second vector. The function is particularly useful in various mathematical and scientific computations where matrix representations of vector relationships are needed.

Example 6.12 (Computing the Covariance and the Correlation Matrix) The following Python code computes the covariance and correlation matrices using the NumPy library.

import numpy as np

def calculate_cov_corr_matrices(data, rowvar=False)->(np.array, np.array):
    """
    Calculate the covariance and correlation matrices of the input data.

    Args:
        data (np.array):
            Input data array.
        rowvar (bool):
            Whether the data is row-wise or column-wise.
            Default is False (column-wise).

    Returns:
        np.array: Covariance matrix.
        np.array: Correlation matrix.

    Examples:
        >>> data = np.array([[1, 2, 3],
                         [4, 5, 6],
                         [7, 8, 9]])
        >>> calculate_cov_corr_matrices(data)
    """   
    cov_matrix = np.cov(data, rowvar=rowvar)
    std_devs = np.sqrt(np.diag(cov_matrix))
    # check whether the standard deviations are zero
    # and throw an error if they are
    if np.any(std_devs == 0):
        raise ValueError("Correlation matrix cannot be computed,"+
                         " because one or more variables have zero variance.")
    corr_matrix = cov_matrix / np.outer(std_devs, std_devs)
    return cov_matrix, corr_matrix
A = np.array([[0,1],
                 [1,0]])
print(f"Input matrix:\n {A}")
Sigma, Psi = calculate_cov_corr_matrices(A)
print(f"Covariance matrix:\n {Sigma}")
print(f"Correlation matrix:\n {Psi}")
Input matrix:
 [[0 1]
 [1 0]]
Covariance matrix:
 [[ 0.5 -0.5]
 [-0.5  0.5]]
Correlation matrix:
 [[ 1. -1.]
 [-1.  1.]]

6.6.4 Correlation and Independence

Definition 6.18 (Statistical Independence (Independence of Random Vectors)) Two random vectors are statistically independent if the joint distribution of the vectors is equal to the product of their marginal distributions.

This means that knowing the realization of one vector gives you no information about the realization of the other vector. This independence is a probabilistic concept used in statistics and probability theory to denote that two sets of random variables do not affect each other. Independence implies that all pairwise covariances between the components of the two vectors are zero, but zero covariance does not imply independence unless certain conditions are met (e.g., normality). Statistical independence is a stronger condition than zero covariance. Statistical independence is not related to the linear independence of vectors in linear algebra.

Example 6.13 (Covariance of Independent Variables) Consider a covariance matrix where variables are independent:

A = np.array([[1,-1],
[2,0],
[3,1],
[4,0],
[5,-1]])
print(f"Input matrix:\n {A}")
Sigma, Psi = calculate_cov_corr_matrices(A)
print(f"Covariance matrix:\n {Sigma}")
print(f"Correlation matrix:\n {Psi}")
Input matrix:
 [[ 1 -1]
 [ 2  0]
 [ 3  1]
 [ 4  0]
 [ 5 -1]]
Covariance matrix:
 [[2.5 0. ]
 [0.  0.7]]
Correlation matrix:
 [[1. 0.]
 [0. 1.]]

Here, since the off-diagonal elements are 0, the variables are uncorrelated. \(X\) increases linearly, while \(Y\) alternates in a simple pattern with no trend that is linearly related to \(Y\).

Example 6.14 (Strong Correlation) For a covariance matrix with strong positive correlation:

A = np.array([[10,-1],
[20,0],
[30,1],
[40,2],
[50,3]])
print(f"Input matrix:\n {A}")
Sigma, Psi = calculate_cov_corr_matrices(A)
print(f"Covariance matrix:\n {Sigma}")
print(f"Correlation matrix:\n {Psi}") 
Input matrix:
 [[10 -1]
 [20  0]
 [30  1]
 [40  2]
 [50  3]]
Covariance matrix:
 [[250.   25. ]
 [ 25.    2.5]]
Correlation matrix:
 [[1. 1.]
 [1. 1.]]

A value close to 1 suggests a strong positive relationship between the variables.

Example 6.15 (Strong Negative Correlation)  

A = np.array([[10,1],
[20,0],
[30,-1],
[40,-2],
[50,-3]])
print(f"Input matrix:\n {A}")
Sigma, Psi = calculate_cov_corr_matrices(A)
print(f"Covariance matrix:\n {Sigma}")
print(f"Correlation matrix:\n {Psi}") 
Input matrix:
 [[10  1]
 [20  0]
 [30 -1]
 [40 -2]
 [50 -3]]
Covariance matrix:
 [[250.  -25. ]
 [-25.    2.5]]
Correlation matrix:
 [[ 1. -1.]
 [-1.  1.]]

This matrix indicates a perfect negative correlation where one variable increases as the other decreases.

6.7 R-Squared in Simple Linear Regression

In simple linear regression, the relationship between the independent variable \(X\) and the dependent variable \(Y\) is modeled using the equation:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Here, \(\beta_0\) is the intercept, \(\beta_1\) is the slope or regression coefficient, and \(\epsilon\) is the error term.

Definition 6.19 (R-Squared (\(R^2\))) \(R^2\) is a measure of how well the regression model explains the variance in the dependent variable. It is calculated as the square of the correlation coefficient (\(r\)) between the actual values \(Y\) and the predicted values \(\hat{Y}\) from the regression model. It ranges from 0 to 1, where:

  • 1 indicates that the regression predictions perfectly fit the data.
  • 0 indicates that the model does not explain any of the variability in the target data around its mean.

In simple linear regression, where there is one independent variable \(X\) and one dependent variable \(Y\), the R-squared (\(R^2\)) is the square of the Pearson correlation coefficient (\(r\)) between the observed values of the dependent variable and the values predicted by the regression model. That is, in simple linear regression, we have
\[ R^2 = r^2. \]

This equivalence holds specifically for simple linear regression due to the direct relationship between the linear fit and the correlation of two variables. In multiple linear regression, while \(R^2\) still represents the proportion of variance explained by the model, it is not simply the square of a single correlation coefficient as it involves multiple predictors.

6.8 Cholesky Decomposition and Positive Definite Matrices

We consider the definiteness of a matrix, before discussing the Cholesky decomposition.

Definition 6.20 (Positive Definite Matrix) A symmetric matrix \(A\) is positive definite if all its eigenvalues are positive.

Example 6.16 (Positive Definite Matrix) Given a symmetric matrix \(A = \begin{pmatrix} 9 & 4 \\ 4 & 9 \end{pmatrix}\), the eigenvalues of \(A\) are \(\lambda_1 = 13\) and \(\lambda_2 = 5\). Since both eigenvalues are positive, the matrix \(A\) is positive definite.

Definition 6.21 (Negative Definite, Positive Semidefinite, and Negative Semidefinite Matrices) Similarily, a symmetric matrix \(A\) is negative definite if all its eigenvalues are negative. It is positive semidefinite if all its eigenvalues are non-negative, and negative semidefinite if all its eigenvalues are non-positive.

The covariance matrix must be positive definite for a multivariate normal distribution for a couple of reasons:

  • Semidefinite vs Definite: A covariance matrix is always symmetric and positive semidefinite. However, for a multivariate normal distribution, it must be positive definite, not just semidefinite. This is because a positive semidefinite matrix can have zero eigenvalues, which would imply that some dimensions in the distribution have zero variance, collapsing the distribution in those dimensions. A positive definite matrix has all positive eigenvalues, ensuring that the distribution has positive variance in all dimensions.
  • Invertibility: The multivariate normal distribution’s probability density function involves the inverse of the covariance matrix. If the covariance matrix is not positive definite, it may not be invertible, and the density function would be undefined.

In summary, the covariance matrix being positive definite ensures that the multivariate normal distribution is well-defined and has positive variance in all dimensions.

The definiteness of a matrix can be checked by examining the eigenvalues of the matrix. If all eigenvalues are positive, the matrix is positive definite.

import numpy as np

def is_positive_definite(matrix):
    return np.all(np.linalg.eigvals(matrix) > 0)

matrix = np.array([[9, 4], [4, 9]])
print(is_positive_definite(matrix))  # Outputs: True
True

However, a more efficient way to check the definiteness of a matrix is through the Cholesky decomposition.

Definition 6.22 (Cholesky Decomposition) For a given symmetric positive-definite matrix \(A \in \mathbb{R}^{n \times n}\), there exists a unique lower triangular matrix \(L \in \mathbb{R}^{n \times n}\) with positive diagonal elements such that:

\[ A = L L^T. \]

Here, \(L^T\) denotes the transpose of \(L\).

Example 6.17 (Cholesky decomposition using numpy) linalg.cholesky computes the Cholesky decomposition of a matrix, i.e., it computes a lower triangular matrix \(L\) such that \(LL^T = A\). If the matrix is not positive definite, an error (LinAlgError) is raised.

import numpy as np

# Define a Hermitian, positive-definite matrix
A = np.array([[9, 4], [4, 9]]) 

# Compute the Cholesky decomposition
L = np.linalg.cholesky(A)

print("L = \n", L)
print("L*LT = \n", np.dot(L, L.T))
L = 
 [[3.         0.        ]
 [1.33333333 2.68741925]]
L*LT = 
 [[9. 4.]
 [4. 9.]]

Example 6.18 (Cholesky Decomposition) Given a symmetric positive-definite matrix \(A = \begin{pmatrix} 9 & 4 \\ 4 & 9 \end{pmatrix}\), the Cholesky decomposition computes the lower triangular matrix \(L\) such that \(A = L L^T\). The matrix \(L\) is computed as: \[ L = \begin{pmatrix} 3 & 0 \\ 4/3 & 2 \end{pmatrix}, \] so that \[ L L^T = \begin{pmatrix} 3 & 0 \\ 4/3 & \sqrt{65}/3 \end{pmatrix} \begin{pmatrix} 3 & 4/3 \\ 0 & \sqrt{65}/3 \end{pmatrix} = \begin{pmatrix} 9 & 4 \\ 4 & 9 \end{pmatrix} = A. \]

An efficient implementation of the definiteness-check based on Cholesky is already available in the numpy library. It provides the np.linalg.cholesky function to compute the Cholesky decomposition of a matrix. This more efficient numpy-approach can be used as follows:

import numpy as np

def is_pd(K):
    try:
        np.linalg.cholesky(K)
        return True
    except np.linalg.linalg.LinAlgError as err:
        if 'Matrix is not positive definite' in err.message:
            return False
        else:
            raise
matrix = np.array([[9, 4], [4, 9]])
print(is_pd(matrix))  # Outputs: True
True

6.9 Maximum Likelihood Estimation: Multivariate Normal Distribution

6.9.1 The Joint Probability Density Function of the Multivariate Normal Distribution

Consider the first \(n\) terms of an identically and independently distributed (i.i..d.) sequence \({X^{(j)}}\) of \(k\)-dimensional multivariate normal random vectors, i.e., \[ X^{(j)} \sim N(\mu, \Sigma), j=1,2,\ldots. \tag{6.3}\]

The joint probability density function of the \(j\)-th term of the sequence is \[ f_X(x_j) = \frac{1}{\sqrt{(2\pi)^k \det(\Sigma)}} \exp\left(-\frac{1}{2} (x_j-\mu)^T\Sigma^{-1} (x_j-\mu)\right), \]

where: \(\mu\) is the \(k \times 1\) mean vector; \(\Sigma\) is the \(k \times k\) covariance matrix. The covariance matrix \(\Sigma\) is assumed to be positive definite, so that its determinant is strictly positive. We use \(x_1, \ldots x_n\), i.e., the realizations of the first \(n\) random vectors in the sequence, to estimate the two unknown parameters \(\mu\) and \(\Sigma\).

6.9.2 The Log-Likelihood Function

Definition 6.23 (Likelihood Function) The likelihood function is defined as the joint probability density function of the observed data, viewed as a function of the unknown parameters.

Since the terms in the sequence Equation 6.3 are independent, their joint density is equal to the product of their marginal densities. As a consequence, the likelihood function can be written as the product of the individual densities:

\[ L(\mu, \Sigma) = \prod_{j=1}^n f_X(x_j) = \prod_{j=1}^n \frac{1}{\sqrt{(2\pi)^k \det(\Sigma)}} \exp\left(-\frac{1}{2} (x_j-\mu)^T\Sigma^{-1} (x_j-\mu)\right) \] \[ = \frac{1}{(2\pi)^{nk/2} \det(\Sigma)^{n/2}} \exp\left(-\frac{1}{2} \sum_{j=1}^n (x_j-\mu)^T\Sigma^{-1} (x_j-\mu)\right). \tag{6.4}\]

Taking the natural logarithm of the likelihood function, we obtain the log-likelihood function:

Example 6.19 (Log-Likelihood Function of the Multivariate Normal Distribution) The log-likelihood function of the multivariate normal distribution is given by \[ \ell(\mu, \Sigma) = -\frac{nk}{2} \ln(2\pi) - \frac{n}{2} \ln(\det(\Sigma)) - \frac{1}{2} \sum_{j=1}^n (x_j-\mu)^T\Sigma^{-1} (x_j-\mu). \]

The likelihood function is well-defined only if \(\det(\Sigma)>0\).

6.10 Constructing a Surrogate

Note

This section is based on chapter 2 in Forrester, Sóbester, and Keane (2008).

Definition 6.24 (Black Box Problem) We are trying to learn a mapping that converts the vector \(\mathbf{x}\) into a scalar output \(y\), i.e., we are trying to learn a function \[ y = f(x). \] If function is hidden (“lives in a black box”), so that the physics of the problem is not known, the problem is called a black box problem.

This black box could take the form of either a physical or computer experiment, for example, a finite element code, which calculates the maximum stress (\(\sigma\)) for given product dimensions (\(\mathbf{x}\)).

Definition 6.25 (Generic Solution) The generic solution method is to collect the output values \(y^{(1)}\), \(y^{(2)}\), , \(y^{(n)}\) that result from a set of inputs \(\mathbf{x}^{(1)}\), \(\mathbf{x}^{(2)}\), , \(\mathbf{x}^{(n)}\) and find a best guess \(\hat{f}(\mathbf{x})\) for the black box mapping \(f\), based on these known observations.

6.10.1 Stage One: Preparing the Data and Choosing a Modelling Approach

The first step is the identification, through a small number of observations, of the inputs that have a significant impact on \(f\); that is the determination of the shortest design variable vector \(\mathbf{x} = \{x_1, x_2, \ldots, x_k\}^T\) that, by sweeping the ranges of all of its variables, can still elicit most of the behavior the black box is capable of. The ranges of the various design variables also have to be established at this stage.

The second step is to recruit \(n\) of these \(k\)-vectors into a list \[ \mathbf{X} = \{ \mathbf{x}^{(1)},\mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(n)} \}^T, \] where each \(\mathbf{x}^{(i)}\) is a \(k\)-vector. The corresponding responses are collected in a vector such that this represents the design space as thoroughly as possible.

In the surrogate modeling process, the number of samples \(n\) is often limited, as it is constrained by the computational cost (money and/or time) associated with obtaining each observation.

It is advisable to scale \(\mathbf{x}\) at this stage into the unit cube \([0, 1]^k\), a step that can simplify the subsequent mathematics and prevent multidimensional scaling issues.

We now focus on the attempt to learn \(f\) through data pairs \[ \{ (\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)}) \}. \]

This supervised learning process essentially involves searching across the space of possible functions \(\hat{f}\) that would replicate observations of \(f\). This space of functions is infinite. Any number of hypersurfaces could be drawn to pass through or near the known observations, accounting for experimental error. However, most of these would generalize poorly; they would be practically useless at predicting responses at new sites, which is the ultimate goal.

Example 6.20 (The Needle(s) in the Haystack Function) An extreme example is the ‘needle(s) in the haystack’ function:

\[ f(x) = \begin{cases} y^{(1)}, & \text{if } x = \mathbf{x}^{(1)} \\ y^{(2)}, & \text{if } x = \mathbf{x}^{(2)} \\ \vdots & \\ y^{(n)}, & \text{if } x = \mathbf{x}^{(n)} \\ 0, & \text{otherwise.} \end{cases} \]

While this predictor reproduces all training data, it seems counter-intuitive and unsettling to predict 0 everywhere else for most engineering functions. Although there is a small chance that the function genuinely resembles the equation above and we sampled exactly where the needles are, it is highly unlikely.

There are countless other configurations, perhaps less contrived, that still generalize poorly. This suggests a need for systematic means to filter out nonsensical predictors. In our approach, we embed the structure of \(f\) into the model selection algorithm and search over its parameters to fine-tune the approximation to observations. For instance, consider one of the simplest models, \[ f(x, \mathbf{w}) = \mathbf{w}^T\mathbf{x} + v. \tag{6.5}\] Learning \(f\) with this model implies that its structure—a hyperplane—is predetermined, and the fitting process involves finding the \(k + 1\) parameters (the slope vector \(\mathbf{w}\) and the intercept \(v\)) that best fit the data. This will be accomplished in Stage Two.

Complicating this further is the noise present in observed responses (we assume design vectors \(\mathbf{x}\) are not corrupted). Here, we focus on learning from such data, which sometimes risks overfitting.

Definition 6.26 (Overfitting) Overfitting occurs when the model becomes too flexible and captures not only the underlying trend but also the noise in the data.

In the surrogate modeling process, the second stage as described in Section 6.10.2, addresses this issue of complexity control by estimating the parameters of the fixed structure model. However, foresight is necessary even at the model type selection stage.

Model selection often involves physics-based considerations, where the modeling technique is chosen based on expected underlying responses.

Example 6.21 (Model Selection) Modeling stress in an elastically deformed solid due to small strains may justify using a simple linear approximation. Without insights into the physics, and if one fails to account for the simplicity of the data, a more complex and excessively flexible model may be incorrectly chosen. Although parameter estimation might still adjust the approximation to become linear, an opportunity to develop a simpler and robust model may be lost.

  • Simple linear (or polynomial) models, despite their lack of flexibility, have advantages like applicability in further symbolic computations.
  • Conversely, if we incorrectly assume a quadratic process when multiple peaks and troughs exist, the parameter estimation stage will not compensate for an unsuitable model choice. A quadratic model is too rigid to fit a multimodal function, regardless of parameter adjustments.

6.10.2 Stage Two: Parameter Estimation and Training

Assuming that Stage One helped identify the \(k\) critical design variables, acquire the learning data set, and select a generic model structure \(f(\mathbf{x}, \mathbf{w})\), the task now is to estimate parameters \(\mathbf{w}\) to ensure the model fits the data optimally. Among several estimation criteria, we will discuss two methods here.

Definition 6.27 (Maximum Likelihood Estimation) Given a set of parameters \(\mathbf{w}\), the model \(f(\mathbf{x}, \mathbf{w})\) allows computation of the probability of the data set \[ \{(\mathbf{x}^{(1)}, y^{(1)} \pm \epsilon), (\mathbf{x}^{(2)}, y^{(2)} \pm \epsilon), \ldots, (\mathbf{x}^{(n)}, y^{(n)} \pm \epsilon)\} \] resulting from \(f\) (where \(\epsilon\) is a small error margin around each data point).

Taking Equation 6.4 and assuming errors \(\epsilon\) are independently and normally distributed with standard deviation \(\sigma\), the probability of the data set is given by:

\[ P = \frac{1}{(2\pi \sigma^2)^{n/2}} \exp \left[ -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \left( y^{(i)} - f(\mathbf{x}^{(i)}, \mathbf{w}) \right)^2 \epsilon \right]. \]

Intuitively, this is equivalent to the likelihood of the parameters given the data. Accepting this intuitive relationship as a mathematical one aids in model parameter estimation. This is achieved by maximizing the likelihood or, more conveniently, minimizing the negative of its natural logarithm:

\[ \min_{\mathbf{w}} \sum_{i=1}^{n} \frac{[y^{(i)} - f(\mathbf{x}^{(i)}, \mathbf{w})]^2}{2\sigma^2} + \frac{n}{2} \ln \epsilon . \tag{6.6}\]

If we assume \(\sigma\) and \(\epsilon\) are constants, Equation 6.6 simplifies to the well-known least squares criterion:

\[ \min_{\mathbf{w}} \sum_{i=1}^{n} [y^{(i)} - f(\mathbf{x}^{(i)}, \mathbf{w})]^2 . \]

Cross-validation is another method used to estimate model performance.

Definition 6.28 (Cross-Validation) Cross-validation splits the data randomly into \(q\) roughly equal subsets, and then cyclically removing each subset and fitting the model to the remaining \(q - 1\) subsets. A loss function \(L\) is then computed to measure the error between the predictor and the withheld subset for each iteration, with contributions summed over all \(q\) iterations. More formally, if a mapping \(\theta: \{1, \ldots, n\} \to \{1, \ldots, q\}\) describes the allocation of the \(n\) training points to one of the \(q\) subsets and \(f^{(-\theta(i))}(\mathbf{x})\) is the predicted value by removing the subset \(\theta(i)\) (i.e., the subset where observation \(i\) belongs), the cross-validation measure, used as an estimate of prediction error, is:

\[ CV = \frac{1}{n} \sum_{i=1}^{n} L(y^{(i)}, f^{(-\theta(i))}(\mathbf{x}^{(i)})) . \tag{6.7}\]

Introducing the squared error as the loss function and considering our generic model \(f\) still dependent on undetermined parameters, we write Equation 6.7 as:

\[ CV = \frac{1}{n} \sum_{i=1}^{n} [y^{(i)} - f^{(-\theta(i))}(\mathbf{x}^{(i)})]^2 . \tag{6.8}\]

The extent to which Equation 6.8 is an unbiased estimator of true risk depends on \(q\). It is shown that if \(q = n\), the leave-one-out cross-validation (LOOCV) measure is almost unbiased. However, LOOCV can have high variance because subsets are very similar. Hastie, Tibshirani, and Friedman (2017)) suggest using compromise values like \(q = 5\) or \(q = 10\). Using fewer subsets also reduces the computational cost of the cross-validation process, see also Arlot, Celisse, et al. (2010) and Kohavi (1995).

6.10.3 Stage Three: Model Testing

If there is a sufficient amount of observational data, a random subset should be set aside initially for model testing. Hastie, Tibshirani, and Friedman (2017) recommend setting aside approximately \(0.25n\) of \(\mathbf{x} \rightarrow y\) pairs for testing purposes. These observations must remain untouched during Stages One and Two, as their sole purpose is to evaluate the testing error—the difference between true and approximated function values at the test sites—once the model has been built. Interestingly, if the main goal is to construct an initial surrogate for seeding a global refinement criterion-based strategy (as discussed in Section 3.2 in Forrester, Sóbester, and Keane (2008)), the model testing phase might be skipped.

It is noted that, ideally, parameter estimation (Stage Two) should also rely on a separate subset. However, observational data is rarely abundant enough to afford this luxury (if the function is cheap to evaluate and evaluation sites are selectable, a surrogate model might not be necessary).

When data are available for model testing and the primary objective is a globally accurate model, using either a root mean square error (RMSE) metric or the correlation coefficient (\(r^2\)) is recommended. To test the model, a test data set of size \(n_t\) is used alongside predictions at the corresponding locations to calculate these metrics.

The RMSE is defined as follows:

Definition 6.29 (Root Mean Square Error (RMSE)) \[ \text{RMSE} = \sqrt{\frac{1}{n_t} \sum_{i=1}^{n_t} (y^{(i)} - \hat{y}^{(i)})^2}, \]

Ideally, the RMSE should be minimized, acknowledging its limitation by errors in the objective function \(f\) calculation. If the error level is known, like a standard deviation, the aim might be to achieve an RMSE within this value. Often, the target is an RMSE within a specific percentage of the observed data’s objective value range.

The squared correlation coefficient \(r\), see Equation 6.2, between the observed \(y\) and predicted \(\hat{y}\) values can be computed as:

\[ r^2 = \left( \frac{\text{cov}(y, \hat{y})}{\sqrt{\text{var}(y)\text{var}(\hat{y})}} \right)^2, \tag{6.9}\]

Equation 6.9 and can be expanded as:

\[ r^2 = \left( \frac{n_t \sum_{i=1}^{n_t} y^{(i)} \hat{y}^{(i)} - \sum_{i=1}^{n_t} y^{(i)} \sum_{i=1}^{n_t} \hat{y}^{(i)}}{ \sqrt{\left( n_t \sum_{i=1}^{n_t} (y^{(i)})^2 - \left(\sum_{i=1}^{n_t} y^{(i)}\right)^2 \right) \left( n_t \sum_{i=1}^{n_t} (\hat{y}^{(i)})^2 - \left(\sum_{i=1}^{n_t} \hat{y}^{(i)}\right)^2 \right)}} \right)^2. \]

The correlation coefficient \(r^2\) does not require scaling the data sets and only compares landscape shapes, not values. An \(r^2 > 0.8\) typically indicates a surrogate with good predictive capability.

The methods outlined provide quantitative assessments of model accuracy, yet visual evaluations can also be insightful. In general, the RMSE will not reach zero but will stabilize around a low value. At this point, the surrogate model is saturated with data, and further additions do not enhance the model globally (though local improvements can occur at newly added points if using an interpolating model).

Example 6.22 (The Tea and Sugar Analogy) Forrester, Sóbester, and Keane (2008) illustrates this saturation point using a comparision with a cup of tea and sugar. The tea represents the surrogate model, and sugar represents data. Initially, the tea is unsweetened, and adding sugar increases its sweetness. Eventually, a saturation point is reached where no more sugar dissolves, and the tea cannot get any sweeter. Similarly, a more flexible model, like one with additional parameters or employing interpolation rather than regression, can increase the saturation point—akin to making a hotter cup of tea for dissolving more sugar.

6.11 Sampling Plans

Definition 6.30 (Sampling Plan) In the context of computer experiments, the term “sampling plan” refers to the set of input values at which the computer code is evaluated.

The goal of a sampling plan is to efficiently explore the input space to understand the behavior of the computer code and build a surrogate model that accurately represents the code’s behavior. Traditionally, Response Surface Methodology (RSM) has been used to design sampling plans for computer experiments. These sampling plans are based on procedures that generate points by means of a rectangiular grid or a factorial design. However, more recently, Design and Analysis of Computer Experiments (DACE) has emerged as a more flexible and powerful approach for designing sampling plans. spotpython uses a class for generating space-filling designs using Latin Hypercube Sampling (LHS) and maximin distance criteria. It is based on scipy’s LatinHypercube class. The following example demonstrates how to generate a Latin Hypercube Sampling design using spotpython. The result is shown in Figure 6.5. As can seen in the figure, a Latin hypercube sample generates \(n\) points in \([0,1)^{d}\). Each univariate marginal distribution is stratified, placing exactly one point in \([j/n, (j+1)/n)\) for \(j=0,1,...,n-1\).

import matplotlib.pyplot as plt
import numpy as np
from spotpython.design.spacefilling import SpaceFilling
lhd = SpaceFilling(k=2, seed=123)
X = lhd.scipy_lhd(n=10, repeats=1, lower=np.array([0, 0]), upper=np.array([10, 10]))
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel('x1')
plt.ylabel('x2')
plt.grid()
Figure 6.5: Latin Hypercube Sampling design (sampling plan)

6.12 Kriging

6.12.1 The Kriging Idea in a Nutshell

Kriging can be applied to planned experiments, where the design is based on a sampling plan as shown in Figure 6.5, as well as to computer experiments, where the design is based on the computer code’s input space, as shown in Figure 6.6.

Figure 6.6: Eight measurements of an unknown function. No sampling plan was used.

Kriging can be explained using the concept of radial basis functions.

Radial basis functions (RBFs) are a class of functions used in various types of interpolation and approximation tasks. An RBF is a real-valued function whose value depends only on the distance from a certain point, called the center, usually in a multidimensional space. This distance is typically measured using the Euclidean distance.

6.12.2 Radial Basis Function (RBF)

Mathematically, a radial basis function \(\phi\) can be expressed as:

\[ \phi(\mathbf{x}) = \phi(\|\mathbf{x} - \mathbf{c}\|), \] where \(\mathbf{x}\) is the input vector, \(\mathbf{c}\) is the center of the function, and \(\|\mathbf{x} - \mathbf{c}\|\) denotes the Euclidean distance between \(\mathbf{x}\) and \(\mathbf{c}\).

Common types of radial basis functions include:

  • Linear: \(\phi(r) = r\).
  • Cubic: \(\phi(r) = r^3\),
  • Gaussian: \(\phi(r) = e^{-(\epsilon r)^2}\),

where \(r\) is the distance and \(\epsilon\) is a parameter that determines the width of the Gaussian function.

Radial basis functions are widely used in interpolation for scattered data points in multiple dimensions, in machine learning models such as radial basis function networks, and in numerical solutions to partial differential equations due to their smooth approximation properties.

In general, we onsider observed data of an unknown function \(f\) at \(n\) points \(x_1, \ldots, x_n\). These measurements a considered as realizations of MVN random variables \(Y_1, \ldots, Y_n\) with mean \(\mu\) and covariance matrix \(\Sigma_n\) as shown in Figure 6.12, Figure 6.13 or Figure 6.14.

In Kriging, a more general covariance matrix (or equivalently, a correlation matrix \(\Psi\)) is used, see Equation 6.10. Using a maximum likelihood approach, we can estimate the unknown parameters \(\mu\) and \(\Sigma_n\) from the data so that the likelihood function is maximized.

Definition 6.31 (The Kriging Basis Functions) Kriging uses \(k\)-dimensional basis functions of the form \[ \psi(\vec{x}^{(i)}, \vec{x}^{(j)}) = \exp \left( - \sum_{l=1}^k \theta_l | x_{l}^{(i)} - x_{l}^{(j)} | ^{p_l} \right), \tag{6.10}\] where \(\vec{x}^{(i)}\) denotes the \(k\)-dim vector \(\vec{x}^{(i)}= (x_1^{(i)}, \ldots, x_k^{(i)})^T\).

6.12.3 The Kriging Model

Consider sample data \(\vec{X}\) and \(\vec{y}\) from \(n\) locations that are available in matrix form: \(\vec{X}\) is a \((n \times k)\) matrix, where \(k\) denotes the problem dimension and \(\vec{y}\) is a \((n\times 1)\) vector. We want to find an expression for a predicted values at a new point \(\vec{x}\), denoted as \(\hat{y}\).

We start with an abstract, not really intuitive concept: The observed responses \(\vec{y}\) are considered as if they are from a stochastic process, which will be denoted as \[ \begin{pmatrix} \vec{Y}(\vec{x}^{(1)})\\ \vdots\\ \vec{Y}(\vec{x}^{(n)})\\ \end{pmatrix}. \tag{6.11}\]

The set of random vectors from Equation 6.11 (also referred to as a random field) has a mean of \(\vec{1} \mu\), which is a \((n\times 1)\) vector. The random vectors are correlated with each other using the basis function expression from Equation 6.10: \[ \text{cor} \left(\vec{Y}(\vec{x}^{(i)}),\vec{Y}(\vec{x}^{(l)}) \right) = \exp\left\{ - \sum_{j=1}^k \theta_j |x_j^{(i)} - x_j^{(l)} |^{p_j}\right\}. \tag{6.12}\] Using Equation 6.12, we can compute the \((n \times n)\) correlation matrix \(\vec{\Psi}\) of the observed sample data as shown in Equation 6.13,

\[ \vec{\Psi} = \begin{pmatrix} \text{cor}\left( \vec{Y}(\vec{x}^{(1)}), \vec{Y}(\vec{x}^{(1)}) \right) & \ldots & \text{cor}\left( \vec{Y}(\vec{x}^{(1)}), \vec{Y}(\vec{x}^{(n)}) \right)\\ \vdots & \vdots & \vdots\\ \text{cor}\left( \vec{Y}(\vec{x}^{(n)}), \vec{Y}(\vec{x}^{(1)}) \right)& \ldots & \text{cor}\left( \vec{Y}(\vec{x}^{(n)}), \vec{Y}(\vec{x}^{(n)}) \right) \end{pmatrix}, \tag{6.13}\]

and a covariance matrix as shown in Equation 6.14,

\[ \text{Cov}(\mathbf{Y}, \mathbf{Y} ) = \sigma^2 \mathbf{\Psi}. \tag{6.14}\]

This assumed correlation between the sample data reflects our expectation that an engineering function will behave in a certain way and it will be smoothly and continuous.

We now have a set of \(n\) random variables (\(\mathbf{Y}\)) that are correlated with each other as described in the \((n \times n)\) correlation matrix \(\mathbf{\Psi}\), see Equation 6.13. The correlations depend on the absolute distances in dimension \(j\) between the \(i\)-th and the \(l\)-th sample point \(|x_j^{(i)} - x_j^{(l)}|\) and the corresponding parameters \(p_j\) and \(\theta_j\) for dimension \(j\). The correlation is intuitive, because when

  • two points move close together, then \(|x_j^{(i)} - x_j| \to 0\) and \(\exp \left(-|x_j^{(i)} - x_j|^{p_j} \right) \to 1\) (these points show very close correlation and \(Y(x_j^{(i)}) = Y(x_j)\)).
  • two points move far apart, then \(|x_j^{(i)} - x_j| \to \infty\) and \(\exp \left(-|x_j^{(i)} - x_j|^{p_j} \right) \to 0\) (these points show very low correlation).

Example 6.23 (Correlations for Different \(p_j\)) Three different correlations are shown in Figure 6.7: \(p_j= 0.1, 1, 2\). The smoothness parameter \(p_j\) affects the correlation:

  • With \(p_j=0.1\), there is basicaly no immediate correlation between the points and there is a near discontinuity between the points \(Y(\vec{x}_j^{(i)})\) and \(Y(\vec{x}_j)\).
  • With \(p_j=2\), the correlation is more smooth and we have a continuous gradient through \(x_j^{(i)} - x_j\).

Reducing \(p_j\) increases the rate at which the correlation initially drops with distance. This is shown in Figure 6.7.

Figure 6.7: Correlations with varying \(\theta\). \(\theta\) set to 1/10, 1, and 10.

Example 6.24 (Correlations for Different \(\theta\)) Figure 6.8 visualizes the correlation between two points \(Y(\vec{x}_j^{(i)})\) and \(Y(\vec{x}_j)\) for different values of \(\theta\). The parameter \(\theta\) can be seen as a width parameter:

  • low \(\theta_j\) means that all points will have a high correlation, with \(Y(x_j)\) being similar across the sample.
  • high \(\theta_j\) means that there is a significant difference between the \(Y(x_j)\)’s.
  • \(\theta_j\) is a measure of how active the function we are approximating is.
  • High \(\theta_j\) indicate important parameters, see Figure 6.8.
Figure 6.8: Correlations with varying \(\theta\). \(\theta\) set to 1/10, 1, and 10.

Considering the activity parameter \(\theta\) is useful in high-dimensional problems where it is difficult to visualize the design landscape and the effect of the variable is unknown. By examining the elements of the vector \(\vec{\theta}\), we can identify the most important variables and focus on them. This is a crucial step in the optimization process, as it allows us to reduce the dimensionality of the problem and focus on the most important variables.

Example 6.25 (Example: The Correlation Matrix (Detailed Computation)) Let \(n=4\) and \(k=3\). The sample plan is represented by the following matrix \(X\): \[ X = \begin{pmatrix} x_{11} & x_{12} & x_{13}\\ x_{21} & x_{22} & x_{23}\\ x_{31} & x_{32} & x_{33}\\ x_{41} & x_{42} & x_{43}\\ \end{pmatrix} \]

To compute the elements of the matrix \(\Psi\), the following \(k\) (one for each of the \(k\) dimensions) \((n,n)\)-matrices have to be computed: \[ D_1 = \begin{pmatrix} x_{11} - x_{11} & x_{11} - x_{21} & x_{11} -x_{31} & x_{11} - x_{41} \\ x_{21} - x_{11} & x_{21} - x_{21} & x_{21} -x_{31} & x_{21} - x_{41} \\ x_{31} - x_{11} & x_{31} - x_{21} & x_{31} -x_{31} & x_{31} - x_{41} \\ x_{41} - x_{11} & x_{41} - x_{21} & x_{41} -x_{31} & x_{41} - x_{41} \\ \end{pmatrix} \]

\[ D_2 = \begin{pmatrix} x_{12} - x_{12} & x_{12} - x_{22} & x_{12} -x_{32} & x_{12} - x_{42} \\ x_{22} - x_{12} & x_{22} - x_{22} & x_{22} -x_{32} & x_{22} - x_{42} \\ x_{32} - x_{12} & x_{32} - x_{22} & x_{32} -x_{32} & x_{32} - x_{42} \\ x_{42} - x_{12} & x_{42} - x_{22} & x_{42} -x_{32} & x_{42} - x_{42} \\ \end{pmatrix} \]

\[ D_3 = \begin{pmatrix} x_{13} - x_{13} & x_{13} - x_{23} & x_{13} -x_{33} & x_{13} - x_{43} \\ x_{23} - x_{13} & x_{23} - x_{23} & x_{23} -x_{33} & x_{23} - x_{43} \\ x_{33} - x_{13} & x_{33} - x_{23} & x_{33} -x_{33} & x_{33} - x_{43} \\ x_{43} - x_{13} & x_{43} - x_{23} & x_{43} -x_{33} & x_{43} - x_{43} \\\end{pmatrix} \]

Since the matrices are symmetric and the main diagonals are zero, it is sufficient to compute the following matrices: \[ D_1 = \begin{pmatrix} 0 & x_{11} - x_{21} & x_{11} -x_{31} & x_{11} - x_{41} \\ 0 & 0 & x_{21} -x_{31} & x_{21} - x_{41} \\ 0 & 0 & 0 & x_{31} - x_{41} \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \] \[ D_2 = \begin{pmatrix} 0 & x_{12} - x_{22} & x_{12} -x_{32} & x_{12} - x_{42} \\ 0 & 0 & x_{22} -x_{32} & x_{22} - x_{42} \\ 0 & 0 & 0 & x_{32} - x_{42} \\ 0 & 0 & 0 & 0 \\ \end{pmatrix} \]

\[ D_3 = \begin{pmatrix} 0 & x_{13} - x_{23} & x_{13} -x_{33} & x_{13} - x_{43} \\ 0 & 0 & x_{23} -x_{33} & x_{23} - x_{43} \\ 0 & 0 & 0 & x_{33} - x_{43} \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

We will consider \(p_l=2\). The differences will be squared and multiplied by \(\theta_i\), i.e.:

\[ D_1 = \theta_1 \begin{pmatrix} 0 & (x_{11} - x_{21})^2 & (x_{11} -x_{31})^2 & (x_{11} - x_{41})^2 \\ 0 & 0 & (x_{21} -x_{31})^2 & (x_{21} - x_{41})^2 \\ 0 & 0 & 0 & (x_{31} - x_{41})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

\[ D_2 = \theta_2 \begin{pmatrix} 0 & (x_{12} - x_{22})^2 & (x_{12} -x_{32})^2 & (x_{12} - x_{42})^2 \\ 0 & 0 & (x_{22} -x_{32})^2 & (x_{22} - x_{42})^2 \\ 0 & 0 & 0 & (x_{32} - x_{42})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

\[ D_3 = \theta_3 \begin{pmatrix} 0 & (x_{13} - x_{23})^2 & (x_{13} -x_{33})^2 & (x_{13} - x_{43})^2 \\ 0 & 0 & (x_{23} -x_{33})^2 & (x_{23} - x_{43})^2 \\ 0 & 0 & 0 & (x_{33} - x_{43})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

The sum of the three matrices \(D=D_1+ D_2 + D_3\) will be calculated next:

\[ \begin{pmatrix} 0 & \theta_1 (x_{11} - x_{21})^2 + \theta_2 (x_{12} - x_{22})^2 + \theta_3 (x_{13} - x_{23})^2 & \theta_1 (x_{11} -x_{31})^2 + \theta_2 (x_{12} -x_{32})^2 + \theta_3 (x_{13} -x_{33})^2 & \theta_1 (x_{11} - x_{41})^2 + \theta_2 (x_{12} - x_{42})^2 + \theta_3 (x_{13} - x_{43})^2 \\ 0 & 0 & \theta_1 (x_{21} -x_{31})^2 + \theta_2 (x_{22} -x_{32})^2 + \theta_3 (x_{23} -x_{33})^2 & \theta_1 x_{21} - x_{41})^2 + \theta_2 (x_{22} - x_{42})^2 + \theta_3 (x_{23} - x_{43})^2 \\ 0 & 0 & 0 & \theta_1 (x_{31} - x_{41})^2 + \theta_2 (x_{32} - x_{42})^2 + \theta_3 (x_{33} - x_{43})^2 \\ 0 & 0 & 0 & 0 \\\end{pmatrix} \]

Finally, \[ \Psi = \exp(-D)\] is computed.

Next, we will demonstrate how this computation can be implemented in Python. We will consider four points in three dimensions and compute the correlation matrix \(\Psi\) using the basis function from Equation 6.10. These points are placed at the origin, at the unit vectors, and at the points \((100, 100, 100)\) and \((101, 100, 100)\). So, they form two clusters: one at the origin and one at \((100, 100, 100)\).

from numpy import (array, zeros, power, ones, exp, multiply,
                    eye, linspace, mat, spacing, sqrt, arange,
                    append, ravel)
from numpy.linalg import cholesky, solve
theta = np.array([1,2,3])
X = np.array([ [1,0,0], [0,1,0], [100, 100, 100], [101, 100, 100]])
X
array([[  1,   0,   0],
       [  0,   1,   0],
       [100, 100, 100],
       [101, 100, 100]])
def build_Psi(X, theta):
    n = X.shape[0]
    k = X.shape[1]
    D = zeros((k, n, n))
    for l in range(k):
        for i in range(n):
            for j in range(i, n):
                D[l, i, j] = theta[l]*(X[i,l] - X[j,l])**2
    D = sum(D)
    D = D + D.T
    return exp(-D)  
Psi = build_Psi(X, theta)
Psi
array([[1.        , 0.04978707, 0.        , 0.        ],
       [0.04978707, 1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.36787944],
       [0.        , 0.        , 0.36787944, 1.        ]])

Example 6.26 (Example: The Correlation Matrix (Using Existing Functions)) The same result as computed in the previous example can be obtained with existing python functions, e.g., from the package scipy.

from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist

def build_Psi(X, theta, eps=sqrt(spacing(1))):
    return exp(- squareform(pdist(X,
                            metric='sqeuclidean',
                            out=None,
                            w=theta))) +  multiply(eye(X.shape[0]),
                                                   eps)

Psi = build_Psi(X, theta, eps=.0)
Psi
array([[1.        , 0.04978707, 0.        , 0.        ],
       [0.04978707, 1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.36787944],
       [0.        , 0.        , 0.36787944, 1.        ]])

6.12.4 The Condition Number

A small value, eps, can be passed to the function build_Psi to improve the condition number. For example, eps=sqrt(spacing(1)) can be used. The numpy function spacing() returns the distance between a number and its nearest adjacent number.

The condition number of a matrix is a measure of its sensitivity to small changes in its elements. It is used to estimate how much the output of a function will change if the input is slightly altered.

A matrix with a low condition number is well-conditioned, which means its behavior is relatively stable, while a matrix with a high condition number is ill-conditioned, meaning its behavior is unstable with respect to numerical precision.

import numpy as np

# Define a well-conditioned matrix (low condition number)
A = np.array([[1, 0.1], [0.1, 1]])
print("Condition number of A: ", np.linalg.cond(A))

# Define an ill-conditioned matrix (high condition number)
B = np.array([[1, 0.99999999], [0.99999999, 1]])
print("Condition number of B: ", np.linalg.cond(B))
Condition number of A:  1.2222222222222225
Condition number of B:  200000000.53159264
np.linalg.cond(Psi)
2.163953413738652

6.12.5 MLE to estimate \(\theta\) and \(p\)

Until now, the observed data \(\vec{y}\) was not used. We know what the correlations mean, but how do we estimate the values of \(\theta_j\) and where does our observed data \(y\) come in? To estimate the values of \(\vec{\theta}\) and \(\vec{p}\), they are chosen to maximize the likelihood of \(\vec{y}\), which can be expressed in terms of the sample data \[L\left(\vec{Y}(\vec{x}^{(1)}), \ldots, \vec{Y}(\vec{x}^{(n)}) | \mu, \sigma \right) = \frac{1}{(2\pi \sigma)^{n/2} |\vec{\Psi}|^{1/2}} \exp\left\{ \frac{-(\vec{y} - \vec{1}\mu)^T \vec{\Psi}^{-1}(\vec{y} - \vec{1}\mu) }{2 \sigma^2}\right\},\] and formulated as the log-likelihood: \[ \ln(L) = - \frac{n}{2} \ln(2\pi \sigma) - \frac{1}{2} \ln |\vec{\Psi}| \frac{-(\vec{y} - \vec{1}\mu)^T \vec{\Psi}^{-1}(\vec{y} - \vec{1}\mu) }{2 \sigma^2}. \tag{6.15}\]

Optimization of the log-likelihood by taking derivatives with respect to \(\mu\) and \(\sigma\) results in \[ \hat{\mu} = \frac{\vec{1}^T \vec{\Psi}^{-1} \vec{y}^T}{\vec{1}^T \vec{\Psi}^{-1} \vec{1}^T} \tag{6.16}\] and \[ \hat{\sigma}^2 = \frac{(\vec{y} - \vec{1}\mu)^T \vec{\Psi}^{-1}(\vec{y} - \vec{1}\mu)}{n}. \tag{6.17}\]

Combining the equations, i.e., substituting Equation 6.16 and Equation 6.17 inti Equation 6.15 leads to the concentrated log-likelihood function: \[ \ln(L) = - \frac{n}{2} \ln(\hat{\sigma}) - \frac{1}{2} \ln |\vec{\Psi}|. \tag{6.18}\]

Remark 6.1 (The Concentrated Log-Likelihood).

  • The first term in Equation 6.18 requires information about the measured point (observations) \(y_i\).
  • To maximize \(\ln(L)\), optimal values of \(\vec{\theta}\) and \(\vec{p}\) are determined numerically, because the function (Equation 6.18) is not differentiable.

The concentrated log-likelihood function is very quick to compute. We do not need a statistical model, because we are only interested in the maximum likelihood estimate (MLE) of \(\theta\) and \(p\). Optimizers such as Nelder-Mead, Conjugate Gradient, or Simulated Annealing can be used to determine optimal values for \(\theta\) and \(p\). After the optimization, the correlation matrix \(\Psi\) is build with the optimized \(\theta\) and \(p\) values. This is best (most likely) Kriging model for the given data \(y\).

Observing Figure 6.8, there’s significant change between \(\theta = 0.1\) and \(\theta = 1\), just as there is between \(\theta = 1\) and \(\theta = 10\). Hence, it is sensible to search for \(\theta\) on a logarithmic scale. Suitable search bounds typically range from \(10^{-3}\) to \(10^2\), although this is not a stringent requirement. Importantly, the scaling of the observed data does not affect the values of \(\hat{\theta}\), but the scaling of the design space does. Therefore, it is advisable to consistently scale variable ranges between zero and one to ensure consistency in the degree of activity \(\hat{\theta}_j\) represents across different problems.

Optimizing \(\hat{\phi}\) can enhance prediction accuracy across various problems.

6.12.6 Implementing an MLE of the Model Parameters

The matrix algebra necessary for calculating the likelihood is the most computationally intensive aspect of the Kriging process. It’s crucial to ensure that the code implementation is as efficient as possible.

Given that \(\Psi\) (our correlation matrix) is symmetric, only half of the matrix needs to be computed before adding it to its transpose. When calculating the log-likelihood, several matrix inversions are required. The fastest approach is to conduct one Cholesky factorization and then apply backward and forward substitution for each inverse.

The Cholesky factorization is applicable only to positive-definite matrices, which \(\Psi\) generally is. However, if \(\Psi\) becomes nearly singular, such as when the \(\mathbf{x}^{(i)}\)’s are densely packed, the Cholesky factorization might fail. In these cases, one could employ an LU-decomposition, though the result might be unreliable. When \(\Psi\) is near singular, the best course of action is to either use regression techniques or, as we do here, assign a poor likelihood value to parameters generating the near singular matrix, thus diverting the MLE search towards better-conditioned \(\Psi\) matrices.

Another consideration in calculating the concentrated log-likelihood is that \(\det(\Psi) \rightarrow 0\) for poorly conditioned matrices, so it is advisable to use twice the sum of the logarithms of the diagonal of the Cholesky factorization when calculating \(\ln(\lvert\Psi\rvert)\) in Equation 6.18.

6.12.7 Kriging Prediction

We will use the Kriging correlation \(\Psi\) to predict new values based on the observed data. The matrix algebra involved for calculating the likelihood is the most computationally intensive part of the Kriging process. Care must be taken that the computer code is as efficient as possible.

Basic elements of the Kriging based surrogate optimization such as interpolation, expected improvement, and regression are presented. The presentation follows the approach described in Forrester, Sóbester, and Keane (2008) and Bartz et al. (2022).

Main idea for prediction is that the new \(Y(\vec{x})\) should be consistent with the old sample data \(X\). For a new prediction \(\hat{y}\) at \(\vec{x}\), the value of \(\hat{y}\) is chosen so that it maximizes the likelihood of the sample data \(\vec{X}\) and the prediction, given the (optimized) correlation parameter \(\vec{\theta}\) and \(\vec{p}\) from above. The observed data \(\vec{y}\) is augmented with the new prediction \(\hat{y}\) which results in the augmented vector \(\vec{\tilde{y}} = ( \vec{y}^T, \hat{y})^T\). A vector of correlations between the observed data and the new prediction is defined as

\[ \vec{\psi} = \begin{pmatrix} \text{cor}\left( \vec{Y}(\vec{x}^{(1)}), \vec{Y}(\vec{x}) \right) \\ \vdots \\ \text{cor}\left( \vec{Y}(\vec{x}^{(n)}), \vec{Y}(\vec{x}) \right) \end{pmatrix} = \begin{pmatrix} \vec{\psi}^{(1)}\\ \vdots\\ \vec{\psi}^{(n)} \end{pmatrix}. \]

Definition 6.32 (The Augmented Correlation Matrix) The augmented correlation matrix is constructed as \[ \tilde{\vec{\Psi}} = \begin{pmatrix} \vec{\Psi} & \vec{\psi} \\ \vec{\psi}^T & 1 \end{pmatrix}. \]

The log-likelihood of the augmented data is \[ \ln(L) = - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\hat{\sigma}^2) - \frac{1}{2} \ln |\vec{\hat{\Psi}}| - \frac{(\vec{\tilde{y}} - \vec{1}\hat{\mu})^T \vec{\tilde{\Psi}}^{-1}(\vec{\tilde{y}} - \vec{1}\hat{\mu})}{2 \hat{\sigma}^2}, \tag{6.19}\]

where \(\vec{1}\) is a vector of ones and \(\hat{\mu}\) and \(\hat{\sigma}^2\) are the MLEs from Equation 6.16 and Equation 6.17. Only the last term in Equation 6.19 depends on \(\hat{y}\), so we need only consider this term in the maximization. Details cen be found in Forrester, Sóbester, and Keane (2008). Finally, the MLE for \(\hat{y}\) can be calculated as \[ \hat{y}(\vec{x}) = \hat{\mu} + \vec{\psi}^T \vec{\tilde{\Psi}}^{-1} (\vec{y} - \vec{1}\hat{\mu}). \tag{6.20}\]

Equation 6.20 reveals two important properties of the Kriging predictor:

  • Basis functions: The basis function impacts the vector \(\vec{\psi}\), which contains the \(n\) correlations between the new point \(\vec{x}\) and the observed locations. Values from the \(n\) basis functions are added to a mean base term \(\mu\) with weightings \[ \vec{w} = \vec{\tilde{\Psi}}^{(-1)} (\vec{y} - \vec{1}\hat{\mu}). \]
  • Interpolation: The predictions interpolate the sample data. When calculating the prediction at the \(i\)th sample point, \(\vec{x}^{(i)}\), the \(i\)th column of \(\vec{\Psi}^{-1}\) is \(\vec{\psi}\), and \(\vec{\psi} \vec{\Psi}^{-1}\) is the \(i\)th unit vector. Hence,

\[ \hat{y}(\vec{x}^{(i)}) = y^{(i)}. \]

6.13 Kriging Example: Sinusoid Function

Toy example in 1d where the response is a simple sinusoid measured at eight equally spaced \(x\)-locations in the span of a single period of oscillation.

6.13.1 Calculating the Correlation Matrix \(\Psi\)

The correlation matrix \(\Psi\) is based on the pairwise squared distances between the input locations. Here we will use \(n=8\) sample locations and \(\theta\) is set to 1.0.

n = 8
X = np.linspace(0, 2*np.pi, n, endpoint=False).reshape(-1,1)
print(np.round(X, 2))
[[0.  ]
 [0.79]
 [1.57]
 [2.36]
 [3.14]
 [3.93]
 [4.71]
 [5.5 ]]

Evaluate at sample points

y = np.sin(X)
print(np.round(y, 2))
[[ 0.  ]
 [ 0.71]
 [ 1.  ]
 [ 0.71]
 [ 0.  ]
 [-0.71]
 [-1.  ]
 [-0.71]]

We have the data points shown in Table 6.1.

Table 6.1: Data points for the sinusoid function
\(x\) \(y\)
0.0 0.0
0.79 0.71
1.57 1.0
2.36 0.71
3.14 0.0
3.93 -0.71
4.71 -1.0
5.5 -0.71

The data points are visualized in Figure 6.9.

import matplotlib.pyplot as plt
plt.plot(X, y, "bo")
plt.title(f"Sin(x) evaluated at {n} points")
plt.grid()
plt.show()
Figure 6.9: Sin(x) evaluated at 8 points.

6.13.2 Computing the \(\Psi\) Matrix

# theta should be an array (of one value, for the moment, will be changed later)
theta = np.array([1.0])
Psi = build_Psi(X, theta)
print(np.round(Psi, 2))
[[1.   0.54 0.08 0.   0.   0.   0.   0.  ]
 [0.54 1.   0.54 0.08 0.   0.   0.   0.  ]
 [0.08 0.54 1.   0.54 0.08 0.   0.   0.  ]
 [0.   0.08 0.54 1.   0.54 0.08 0.   0.  ]
 [0.   0.   0.08 0.54 1.   0.54 0.08 0.  ]
 [0.   0.   0.   0.08 0.54 1.   0.54 0.08]
 [0.   0.   0.   0.   0.08 0.54 1.   0.54]
 [0.   0.   0.   0.   0.   0.08 0.54 1.  ]]

6.13.3 Selecting the New Locations

We would like to predict at \(m = 100\) new locations (or testign locations) in the interval \([0, 2\pi]\). The new locations are stored in the variable x.

m = 100
x = np.linspace(0, 2*np.pi, m, endpoint=False).reshape(-1,1)

6.13.4 Computing the \(\psi\) Vector

Distances between testing locations \(x\) and training data locations \(X\).

from scipy.spatial.distance import cdist

def build_psi(X, x, theta, eps=sqrt(spacing(1))):
    n = X.shape[0]
    k = X.shape[1]
    m = x.shape[0]
    psi = zeros((n, m))
    theta = theta * ones(k)
    D = zeros((n, m))
    D = cdist(x.reshape(-1, k),
              X.reshape(-1, k),
              metric='sqeuclidean',
              out=None,
              w=theta)    
    psi = exp(-D)
    # return psi transpose to be consistent with the literature
    print(f"Dimensions of psi: {psi.T.shape}")
    return(psi.T)

psi = build_psi(X, x, theta)
Dimensions of psi: (8, 100)

6.13.5 Predicting at New Locations

Computation of the predictive equations.

U = cholesky(Psi).T
one = np.ones(n).reshape(-1,1)
mu = (one.T.dot(solve(U, solve(U.T, y)))) / one.T.dot(solve(U, solve(U.T, one)))
f = mu * ones(m).reshape(-1,1) + psi.T.dot(solve(U, solve(U.T, y - one * mu)))
print(f"Dimensions of f: {f.shape}")
Dimensions of f: (100, 1)

To compute \(f\), Equation 6.20 is used.

6.13.6 Visualization

import matplotlib.pyplot as plt
plt.plot(x, f, color = "orange", label="Fitted")
plt.plot(x, np.sin(x), color = "grey", label="Original")
plt.plot(X, y, "bo", label="Measurements")
plt.title("Kriging prediction of sin(x) with {} points.\n theta: {}".format(n, theta[0]))
plt.legend(loc='upper right')
plt.show()

6.14 Cholesky Decomposition

6.14.1 Example of Cholesky Decomposition

We consider dimension \(k=1\) and \(n=2\) sample points. The sample points are located at \(x_1=1\) and \(x_2=5\). The response values are \(y_1=2\) and \(y_2=10\). The correlation parameter is \(\theta=1\) and \(p\) is set to \(1\). Using Equation 6.10, we can compute the correlation matrix \(\Psi\):

\[ \Psi = \begin{pmatrix} 1 & e^{-1}\\ e^{-1} & 1 \end{pmatrix}. \]

To determine MLE as in Equation 6.20, we need to compute \(\Psi^{-1}\):

\[ \Psi^{-1} = \frac{e}{e^2 -1} \begin{pmatrix} e & -1\\ -1 & e \end{pmatrix}. \]

Cholesky-decomposition of \(\Psi\) is recommended to compute \(\Psi^{-1}\). Cholesky decomposition is a decomposition of a positive definite symmetric matrix into the product of a lower triangular matrix \(L\), a diagonal matrix \(D\) and the transpose of \(L\), which is denoted as \(L^T\). Consider the following example:

\[ LDL^T= \begin{pmatrix} 1 & 0 \\ l_{21} & 1 \end{pmatrix} \begin{pmatrix} d_{11} & 0 \\ 0 & d_{22} \end{pmatrix} \begin{pmatrix} 1 & l_{21} \\ 0 & 1 \end{pmatrix}= \]

\[ \begin{pmatrix} d_{11} & 0 \\ d_{11} l_{21} & d_{22} \end{pmatrix} \begin{pmatrix} 1 & l_{21} \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} d_{11} & d_{11} l_{21} \\ d_{11} l_{21} & d_{11} l_{21}^2 + d_{22} \end{pmatrix}. \tag{6.21}\]

Using Equation 6.21, we can compute the Cholesky decomposition of \(\Psi\):

  1. \(d_{11} = 1\),
  2. \(l_{21}d_{11} = e^{-1} \Rightarrow l_{21} = e^{-1}\), and
  3. \(d_{11} l_{21}^2 + d_{22} = 1 \Rightarrow d_{22} = 1 - e^{-2}\).

The Cholesky decomposition of \(\Psi\) is \[ \Psi = \begin{pmatrix} 1 & 0\\ e^{-1} & 1\\ \end{pmatrix} \begin{pmatrix} 1 & 0\\ 0 & 1 - e^{-2}\\ \end{pmatrix} \begin{pmatrix} 1 & e^{-1}\\ 0 & 1\\ \end{pmatrix} = LDL^T\]

Some programs use \(U\) instead of \(L\). The Cholesky decomposition of \(\Psi\) is \[ \Psi = LDL^T = U^TDU. \]

Using \[ \sqrt{D} =\begin{pmatrix} 1 & 0\\ 0 & \sqrt{1 - e^{-2}}\\ \end{pmatrix}, \] we can write the Cholesky decomposition of \(\Psi\) without a diagonal matrix \(D\) as \[ \Psi = \begin{pmatrix} 1 & 0\\ e^{-1} & \sqrt{1 - e^{-2}}\\ \end{pmatrix} \begin{pmatrix} 1 & e^{-1}\\ 0 & \sqrt{1 - e^{-2}}\\ \end{pmatrix} = U^TU. \]

6.14.2 Inverse Matrix Using Cholesky Decomposition

To compute the inverse of a matrix using the Cholesky decomposition, you can follow these steps:

  1. Decompose the matrix \(A\) into \(L\) and \(L^T\), where \(L\) is a lower triangular matrix and \(L^T\) is the transpose of \(L\).
  2. Compute \(L^{-1}\), the inverse of \(L\).
  3. The inverse of \(A\) is then \((L^{-1})^T L^-1\).

Please note that this method only applies to symmetric, positive-definite matrices.

The inverse of the matrix \(\Psi\) from above is:

\[ \Psi^{-1} = \frac{e}{e^2 -1} \begin{pmatrix} e & -1\\ -1 & e \end{pmatrix}. \]

Here’s an example of how to compute the inverse of a matrix using Cholesky decomposition in Python:

import numpy as np
from scipy.linalg import cholesky, inv
E = np.exp(1)

# Psi is a symmetric, positive-definite matrix 
Psi = np.array([[1, 1/E], [1/E, 1]])
L = cholesky(Psi, lower=True)
L_inv = inv(L)
# The inverse of A is (L^-1)^T * L^-1
Psi_inv = np.dot(L_inv.T, L_inv)

print("Psi:\n", Psi)
print("Psi Inverse:\n", Psi_inv)
Psi:
 [[1.         0.36787944]
 [0.36787944 1.        ]]
Psi Inverse:
 [[ 1.15651764 -0.42545906]
 [-0.42545906  1.15651764]]

6.15 Gaussian Processes—Some Background Information

The concept of GP (Gaussian Process) regression can be understood as a simple extension of linear modeling. It is worth noting that this approach goes by various names and acronyms, including “kriging,” a term derived from geostatistics, as introduced by Matheron in 1963. Additionally, it is referred to as Gaussian spatial modeling or a Gaussian stochastic process, and machine learning (ML) researchers often use the term Gaussian process regression (GPR). In all of these instances, the central focus is on regression. This involves training on both inputs and outputs, with the ultimate objective of making predictions and quantifying uncertainty (referred to as uncertainty quantification or UQ).

However, it’s important to emphasize that GPs are not a universal solution for every problem. Specialized tools may outperform GPs in specific, non-generic contexts, and GPs have their own set of limitations that need to be considered.

6.15.1 Gaussian Process Prior

In the context of GP, any finite collection of realizations, which is represented by \(n\) observations, is modeled as having a multivariate normal (MVN) distribution. The characteristics of these realizations can be fully described by two key parameters:

  1. Their mean, denoted as an \(n\)-vector \(\mu\).
  2. The covariance matrix, denoted as an \(n \times n\) matrix \(\Sigma\). This covariance matrix encapsulates the relationships and variability between the individual realizations within the collection.

6.15.2 Covariance Function

The covariance function is defined by inverse exponentiated squared Euclidean distance: \[ \Sigma(\vec{x}, \vec{x}') = \exp\{ - || \vec{x} - \vec{x}'||^2 \}, \] where \(\vec{x}\) and \(\vec{x}'\) are two points in the \(k\)-dimensional input space and \(\| \cdot \|\) denotes the Euclidean distance, i.e., \[ || \vec{x} - \vec{x}'||^2 = \sum_{i=1}^k (x_i - x_i')^2. \]

An 1-d example is shown in Figure 6.10.

Figure 6.10: One-dim inverse exponentiated squared Euclidean distance

The covariance function is also referred to as the kernel function. The Gaussian kernel uses an additional parameter, \(\sigma^2\), to control the rate of decay. This parameter is referred to as the length scale or the characteristic length scale. The covariance function is then defined as

\[ \Sigma(\vec{x}, \vec{x}') = \exp\{ - || \vec{x} - \vec{x}'||^2 / (2 \sigma^2) \}. \tag{6.22}\]

The covariance decays exponentially fast as \(\vec{x}\) and \(\vec{x}'\) become farther apart. Observe that

\[ \Sigma(\vec{x},\vec{x}) = 1 \] and

\[ \Sigma(\vec{x}, \vec{x}') < 1 \] for \(\vec{x} \neq \vec{x}'\). The function \(\Sigma(\vec{x},\vec{x}')\) must be positive definite.

Remark 6.2 (Kriging and Gaussian Basis Functions). The Kriging basis function (Equation 6.10) is related to the 1-dim Gaussian basis function (Equation 6.22), which is defined as \[ \Sigma(\vec{x}^{(i)}, \vec{x}^{(j)}) = \exp\{ - || \vec{x}^{(i)} - \vec{x}^{(j)}||^2 / (2\sigma^2) \}. \tag{6.23}\]

There are some differences between Gaussian basis functions and Kriging basis functions:

  • Where the Gaussian basis function has \(1/(2\sigma^2)\), the Kriging basis has a vector \(\theta = [\theta_1, \theta_2, \ldots, \theta_k]^T\).
  • The \(\theta\) vector allows the width of the basis function to vary from dimension to dimension.
  • In the Gaussian basis function, the exponent is fixed at 2, Kriging allows this exponent \(p_l\) to vary (typically from 1 to 2).

6.15.2.1 Positive Definiteness

Positive definiteness in the context of the covariance matrix \(\Sigma_n\) is a fundamental requirement. It is determined by evaluating \(\Sigma(x_i, x_j)\) at pairs of \(n\) \(\vec{x}\)-values, denoted as \(\vec{x}_1, \vec{x}_2, \ldots, \vec{x}_n\). The condition for positive definiteness is that for all \(\vec{x}\) vectors that are not equal to zero, the expression \(\vec{x}^\top \Sigma_n \vec{x}\) must be greater than zero. This property is essential when intending to use \(\Sigma_n\) as a covariance matrix in multivariate normal (MVN) analysis. It is analogous to the requirement in univariate Gaussian distributions where the variance parameter, \(\sigma^2\), must be positive.

Gaussian Processes (GPs) can be effectively utilized to generate random data that follows a smooth functional relationship. The process involves the following steps:

  1. Select a set of \(\vec{x}\)-values, denoted as \(\vec{x}_1, \vec{x}_2, \ldots, \vec{x}_n\).
  2. Define the covariance matrix \(\Sigma_n\) by evaluating \(\Sigma_n^{ij} = \Sigma(\vec{x}_i, \vec{x}_j)\) for \(i, j = 1, 2, \ldots, n\).
  3. Generate an \(n\)-variate realization \(Y\) that follows a multivariate normal distribution with a mean of zero and a covariance matrix \(\Sigma_n\), expressed as \(Y \sim \mathcal{N}_n(0, \Sigma_n)\).
  4. Visualize the result by plotting it in the \(x\)-\(y\) plane.

6.15.3 Construction of the Covariance Matrix

Here is an one-dimensional example. The process begins by creating an input grid using \(\vec{x}\)-values. This grid consists of 100 elements, providing the basis for further analysis and visualization.

import numpy as np
n = 100
X = np.linspace(0, 10, n, endpoint=False).reshape(-1,1)

In the context of this discussion, the construction of the covariance matrix, denoted as \(\Sigma_n\), relies on the concept of inverse exponentiated squared Euclidean distances. However, it’s important to note that a modification is introduced later in the process. Specifically, the diagonal of the covariance matrix is augmented with a small value, represented as “eps” or \(\epsilon\).

The reason for this augmentation is that while inverse exponentiated distances theoretically ensure the covariance matrix’s positive definiteness, in practical applications, the matrix can sometimes become numerically ill-conditioned. By adding a small value to the diagonal, such as \(\epsilon\), this ill-conditioning issue is mitigated. In this context, \(\epsilon\) is often referred to as “jitter.”

import numpy as np
from numpy import array, zeros, power, ones, exp, multiply, eye, linspace, mat, spacing, sqrt, arange, append, ravel
from numpy.linalg import cholesky, solve
from numpy.random import multivariate_normal
def build_Sigma(X, sigma2):
    n = X.shape[0]
    k = X.shape[1]
    D = zeros((k, n, n))
    for l in range(k):
        for i in range(n):
            for j in range(i, n):
                D[l, i, j] = 1/(2*sigma2[l])*(X[i,l] - X[j,l])**2
    D = sum(D)
    D = D + D.T
    return exp(-D)  
sigma2 = np.array([1.0])
Sigma = build_Sigma(X, sigma2)
np.round(Sigma[:3,:], 3)
array([[1.   , 0.995, 0.98 , 0.956, 0.923, 0.882, 0.835, 0.783, 0.726,
        0.667, 0.607, 0.546, 0.487, 0.43 , 0.375, 0.325, 0.278, 0.236,
        0.198, 0.164, 0.135, 0.11 , 0.089, 0.071, 0.056, 0.044, 0.034,
        0.026, 0.02 , 0.015, 0.011, 0.008, 0.006, 0.004, 0.003, 0.002,
        0.002, 0.001, 0.001, 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   ],
       [0.995, 1.   , 0.995, 0.98 , 0.956, 0.923, 0.882, 0.835, 0.783,
        0.726, 0.667, 0.607, 0.546, 0.487, 0.43 , 0.375, 0.325, 0.278,
        0.236, 0.198, 0.164, 0.135, 0.11 , 0.089, 0.071, 0.056, 0.044,
        0.034, 0.026, 0.02 , 0.015, 0.011, 0.008, 0.006, 0.004, 0.003,
        0.002, 0.002, 0.001, 0.001, 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   ],
       [0.98 , 0.995, 1.   , 0.995, 0.98 , 0.956, 0.923, 0.882, 0.835,
        0.783, 0.726, 0.667, 0.607, 0.546, 0.487, 0.43 , 0.375, 0.325,
        0.278, 0.236, 0.198, 0.164, 0.135, 0.11 , 0.089, 0.071, 0.056,
        0.044, 0.034, 0.026, 0.02 , 0.015, 0.011, 0.008, 0.006, 0.004,
        0.003, 0.002, 0.002, 0.001, 0.001, 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
        0.   ]])
import matplotlib.pyplot as plt
plt.imshow(Sigma, cmap='hot', interpolation='nearest')
plt.colorbar()
plt.show()

6.15.4 Generation of Random Samples and Plotting the Realizations of the Random Function

In the context of the multivariate normal distribution, the next step is to utilize the previously constructed covariance matrix denoted as Sigma. It is used as an essential component in generating random samples from the multivariate normal distribution.

The function multivariate_normal is employed for this purpose. It serves as a random number generator specifically designed for the multivariate normal distribution. In this case, the mean of the distribution is set equal to mean, and the covariance matrix is provided as Psi. The argument size specifies the number of realizations, which, in this specific scenario, is set to one.

By default, the mean vector is initialized to zero. To match the number of samples, which is equivalent to the number of rows in the X and Sigma matrices, the argument zeros(n) is used, where n represents the number of samples (here taken from the size of the matrix, e.g.,: Sigma.shape[0]).

rng = np.random.default_rng(seed=12345)
Y = rng.multivariate_normal(zeros(Sigma.shape[0]), Sigma, size = 1, check_valid="raise").reshape(-1,1)
Y.shape
(100, 1)

Now we can plot the results, i.e., a finite realization of the random function \(Y()\) under a GP prior with a particular covariance structure. We will plot those X and Y pairs as connected points on an \(x\)-\(y\) plane.

import matplotlib.pyplot as plt
plt.plot(X, Y)
plt.title("Realization of Random Functions under a GP prior.\n sigma2: {}".format(sigma2[0]))
plt.show()
Figure 6.11: Realization of one random function under a GP prior. sigma2: 1.0
rng = np.random.default_rng(seed=12345)
Y = rng.multivariate_normal(zeros(Sigma.shape[0]), Sigma, size = 3, check_valid="raise")
plt.plot(X, Y.T)
plt.title("Realization of Three Random Functions under a GP prior.\n sigma2: {}".format(sigma2[0]))
plt.show()
Figure 6.12: Realization of three random functions under a GP prior. sigma2: 1.0

6.15.5 Properties of the 1d Example

6.15.5.1 Several Bumps:

In this analysis, we observe several bumps in the \(x\)-range of \([0,10]\). These bumps in the function occur because shorter distances exhibit high correlation, while longer distances tend to be essentially uncorrelated. This leads to variations in the function’s behavior:

  • When \(x\) and \(x'\) are one \(\sigma\) unit apart, the correlation is \(\exp\left(-\sigma^2 / (2\sigma^2)\right) = \exp(-1/2) \approx 0.61\), i.e., a relative high correlation.
  • \(2\sigma\) apart means correlation \(\exp(− 2^2 /2) \approx 0.14\), i.e., only small correlation.
  • \(4\sigma\) apart means correlation \(\exp(− 4^2 /2) \approx 0.0003\), i.e., nearly no correlation—variables are considered independent for almost all practical application.

6.15.5.2 Smoothness:

The function plotted in Figure 6.11 represents only a finite realization, which means that we have data for a limited number of pairs, specifically 100 points. These points appear smooth in a tactile sense because they are closely spaced, and the plot function connects the dots with lines to create the appearance of smoothness. The complete surface, which can be conceptually extended to an infinite realization over a compact domain, is exceptionally smooth in a calculus sense due to the covariance function’s property of being infinitely differentiable.

6.15.5.3 Scale of Two:

Regarding the scale of the \(Y\) values, they have a range of approximately \([-2,2]\), with a 95% probability of falling within this range. In standard statistical terms, 95% of the data points typically fall within two standard deviations of the mean, which is a common measure of the spread or range of data.

import numpy as np
from numpy import array, zeros, power, ones, exp, multiply, eye, linspace, mat, spacing, sqrt, arange, append, ravel
from numpy.random import multivariate_normal

def build_Sigma(X, sigma2):
    n = X.shape[0]
    k = X.shape[1]
    D = zeros((k, n, n))
    for l in range(k):
        for i in range(n):
            for j in range(i, n):
                D[l, i, j] = 1/(2*sigma2[l])*(X[i,l] - X[j,l])**2
    D = sum(D)
    D = D + D.T
    return exp(-D)

def plot_mvn( a=0, b=10, sigma2=1.0, size=1, n=100, show=True):    
    X = np.linspace(a, b, n, endpoint=False).reshape(-1,1)
    sigma2 = np.array([sigma2])
    Sigma = build_Sigma(X, sigma2)
    rng = np.random.default_rng(seed=12345)
    Y = rng.multivariate_normal(zeros(Sigma.shape[0]), Sigma, size = size, check_valid="raise")
    plt.plot(X, Y.T)
    plt.title("Realization of Random Functions under a GP prior.\n sigma2: {}".format(sigma2[0]))
    if show:
        plt.show()
plot_mvn(a=0, b=10, sigma2=10.0, size=3, n=250)
Figure 6.13: Realization of Random Functions under a GP prior. sigma2: 10
plot_mvn(a=0, b=10, sigma2=0.1, size=3, n=250)
Figure 6.14: Realization of Random Functions under a GP prior. sigma2: 0.1

6.16 Jupyter Notebook

Note