# IntroductionΒΆ

We consider a general minimization problem:

Bayesian optimization proceeds through construction of a surrogate cost function \(\tilde{y}(\mathbf{x})\). This surrogate is represented as a basis function expansion, around each evaluated point(\(\mathbf{x}_i, i \in [1,N]\)). This ensures that the surrogate passes through (interpolates) the evaluated points. In the case of evaluations with noisy data, the surrogate shall pass within 1 standard deviation from the mean at the evaluated points. Analytically, the surrogate \(\tilde{y}(\mathbf{x})\) after N function evaluations is represented as

where \(k(\mathbf{x}, \mathbf{x}_i)\) is a kernel function, i.e., it takes in two arguments, \(\mathbf{x}, \, \mathbf{x}_i\) and returns a scalar value. This scalar is representative of how correlated is the function \(y(\mathbf{x})\) at \(\mathbf{x}\) and \(\mathbf{x}_i\). The weights \(w_i\) are calculated by solving the system of \(N\) linear equations in \(w_i\). In matrix notation, this is represented using a covariance matrix(\(\mathbf{K}\)):

Hence the weights are calculated through the inversion \(\bar{w} = \mathbf{K}^{-1}\,y\). Note that the covariance matrix \(\mathbf{K}\) is a Gram matrix of a positive definite kernel function, making it symmetric and positive semi-definite. Furthermore, since with every iteration only a finite number of rows are added to the covariance matrix, efficient inversion is possible through incremental Cholesky decomposition. The mean and variance of the surrogate are then calculated as:

where

At each iteration, the surrogate is updated with new data from the cost function. The locations where the next
evaluation is done is determined through optimization of an *acquisition function*. An acquisition function is a means
to estimate the new information content at a location. It uses the mean and variance calculated in the above steps.

Some sample radial kernel functions include:

- squared exponential kernel function : Infinitely differentiable

- Matern class of kernel function :

where \(K_\nu\) is the modified Bessel function, \(\nu,l\) are positive constants

- Rational quadratic kernel function:

where \(r = ||\mathbf{x}_1 - \mathbf{x}_2||\)

Some example acquisition functions are:

- Confidence bounds

- Probability of improvement

- Expectation of improvement

where \(\gamma = \frac{\mu}{\sigma}\), \(\mathbf{cdf}\) is cumulative normal distribution function and \(\mathbf{pdf}\) is normal probability distribution function