almosallam_heteroscedastic_2017

Heteroscedastic Gaussian processes for uncertain and incomplete data

Ibrahim Almosallam

PhD Thesis, University of Oxford, https://ora.ox.ac.uk/objects/uuid:6a3b600d-5759-456a-b785-5f89cf4ede6d

Some useful identities:

$p(x) = \int p(x,y)\rm{d}y = \int p(x|y)p(y)\rm{d}y \tag{A.6}$

$p(x) \tag{1}$

$\mathcal{N}(\bm{x}|\bm{a},\bm{A}) \mathcal{N}(\bm{x}|\bm{b},\bm{B}) = \mathcal{N}(\bm{a}|\bm{b},\bm{A+B})\mathcal{N}(\bm{x}|\bm{c},\bm{C})\tag{A.14}$

$\bm{c}=\bm{C}(\bm{A}^{-1}\bm{a}+\bm{B}^{-1}\bm{b}),\quad \bm{C}=(\bm{A}^{-1}+\bm{B}^{-1})^{-1}$

$p(\bm{x}............. \tag{A.11}$

# 2 Gaussian Processes for Regression

## 2.6 Summary

We have shown in this chapter that basis function models are sparse Gaussian processes of type SoR, but with a different prior on the latent variables.

BFM 的 latent variables 的 prior 与 SoR 不同，其它相同。

A main distinction between the way ANNs are traditionally optimised and how BFMs are trained, is that the former performs a regularised maximum likelihood estimation, or maximum a posterior (MAP), while the latter maximises the marginal likelihood. Maximising the marginal likelihood is less prone to overfitting, due to the additional cost term that minimises the variance on the weight vector w, which we have shown is related to the sparsity constraint in sparse auto-encoders. In the next chapter, we show how we can use this view of sparse GPs to extend their modelling capabilities.

？？Maximum likelihood estimation 和 MAP 是同一个概念（核实一下）

## 2.1 Full Gaussian Processes

The dataset is $\mathcal{D}=\{\bm{X},\bm{y}\}$, consisting of input $\bm{X}=\{\bm{x}_i\}_{i=1}^n \in \mathbb{R}^{d\times n}$ and target outputs $\bm{y}=\{y_i\}_{i=1}^n \in \mathbb{R}^n$.

$d$ is the dimensionality of the input $\bm{x}_i$

$n$ is the number of data points

ASSUMPTION: observed target $y_i$ is generated by a function of $\bm{x}_i$ plus additive noise $\varepsilon_i$:

$y_i = f(\bm{x_i}) + \varepsilon_i \tag{2.1}$

$\varepsilon_i \sim \mathcal{N}(0,\sigma^2),\quad \forall i\in[0,n]$

LIKELIHOOD: the probability of observing the targets,

$p(\bm{y} | \bm{f}_\bm{x},\sigma^2) = \mathcal{N}(\bm{y} | \bm{f}_\bm{x}, \sigma^2 I_n) \tag{2.2}$

$\bm{f}_\bm{x} = [f(\bm{x}_1),\cdots,f(\bm{x}_n)]^T$

POSTERIOR (?): Distribution of $\bm{f}_\bm{x}$ given the observations

$p(\bm{f}_\bm{x} | \bm{y},\bm{X},\sigma^2) = \frac{p(\bm{y}|\bm{f}_\bm{x},\sigma^2) \cdot p(\bm{f}_\bm{x}|\bm{X})} {p(\bm{y}|\bm{X},\sigma^2)} \tag{2.3}$

PRIOR: over the space of functions,

$p(\bm{f}_\bm{X}|\bm{X}) = \mathcal{N}(0,\bm{K})$

$\bm{K} = \kappa(\bm{X},\bm{X})$, symmetric and positive semi-definite

$\kappa(\bm{x}_i,\bm{x}_j)$ is the covariance/kernel function, for example, Mercer kernels.

MARGINAL LIKELIHOOD, computed from Eq.(A.6) and (A.14):

$p(\bm{y}|\bm{X},\bm{\theta}) = \int p(\bm{y}|\bm{f}_\bm{x},\bm{\theta}) \cdot p(\bm{f}_\bm{x}|\bm{X},\bm{\theta}) \cdot \rm{d} \bm{f}_\bm{x} = \mathcal{N}(\bm{y}|0,\bm{K}+\sigma^2 I_n) \tag{2.7}$

$\bm{\theta} = \{\bm{\varphi},\sigma^2\}$, hyper-parameters of the GP model

$\bm{\varphi}$, parameters of the kernel

$\sigma^2$, the noise variance

ASSUMPTION: the joint distribution is a multivariate Gaussian,

$p(\bm{y},\bm{f}_*|\bm{X},\bm{X}_*,\bm{\theta}) = \mathcal{N} \left( \left.\left[\begin{matrix}\bm{y}\\\bm{f}_*\end{matrix}\right]\right| \bm{0}, \left[\begin{matrix}\bm{K}_{\bm{xx}}+\sigma^2 I_n & \bm{K}_{\bm{x}*}\\\bm{K}_{*\bm{x}} & \bm{K}_{**}\end{matrix}\right] \right) \tag{2.9}$

CONDITIONAL PROBABILITY of function value $\bm{f}_*$, computed from Eq.(A.11),

$p(\bm{f}_*|\bm{X}_*,\mathcal{D},\bm{\theta}) = \mathcal{N} (\bm{f}_* | \mathbb{E}[\bm{f}_*],\mathbb{V}[\bm{f}_*]) \tag{2.10}$

$\mathbb{E}[\bm{f}_*] = \bm{K}_{*\bm{x}}(\bm{K}_{\bm{xx}}+\sigma^2 I_n)^{-1}\bm{y}$

$\mathbb{V}[\bm{f}_*] = \bm{K}_{**}-\bm{K}_{*\bm{x}}(\bm{K}_\bm{xx}+\sigma^2 I_n)^{-1}\bm{K}_{\bm{x}*}$

PREDICTIVE DISTRIBUTION of future observations $\bm{y}_*$ given the dataset $\mathcal{D}$, computed from Eq.(A.4) and (A.16),

$p(\bm{y}_*|\bm{X}_*,\mathcal{D},\bm{\theta}) = \int p(\bm{y}_*|\bm{f}_*,\bm{\theta}) \cdot p(\bm{f}_*|\bm{X}_*,\mathcal{D},\bm{\theta}) \cdot \rm{d}\bm{f}_* = \mathcal{N}(\bm{y}_* | \mathbb{E}[\bm{y}_*],\mathbb{V}[\bm{y}_*]) \tag{2.13}$

$\mathbb{E}[\bm{y}_*] = \mathbb{E}[\bm{f}_*]$

$\mathbb{V}[\bm{y}_*] = \mathbb{V}[\bm{f}_*] + \sigma^2 I_{n*}$

The hyperparameters of the GP model are optimized by maximizing the log-marginal likelihood, Eq.(2.7):

$\ln p(\bm{y}|\bm{X},\bm{\theta}) = -\frac{1}{2} \bm{y}^T(\bm{K}+\sigma^2 I_n)^{-1}\bm{y} - \frac{1}{2} \ln |\bm{K}+\sigma^2 I_n| - \frac{n}{2}\ln(2\pi) \tag{2.17}$

Almost the same description and notations to the GPML book (Rasmussen and Nickisch, 2010).

## 2.2 Numerical Approximation

### Low Rank Approximation

$\Sigma_p = \sigma^{-2}K_{px}K_{xp} + K_{pp} \tag{2.23}$

## 2.3 Sparse Gaussian Processes

### 2.3.1 Subset of Regressor (SoR)

INDUCING / LATENT variables $\bm{f}_p$: $p(\bm{f}_p)=\mathcal{N}(\bm{f}_p|0,K_{pp})$.

JOINT distribution:

$p(\bm{f}_x,\bm{f}_p) = \int p(\bm{f}_x,\bm{f}_*|\bm{f}_p) p(\bm{f}_p) \cdot {\rm d}\bm{f}_p \tag{2.24}$

ASSUMPTION: the conditional distributions of $\bm{f}_x$ and $\bm{f}_{*}$ are independent given $\bm{f}_p$.

$p(\bm{f}_x,\bm{f}_p) \approx q(\bm{f}_x,\bm{f}_p) = \int q(\bm{f}_x|\bm{f}_p) q(\bm{f}_*|\bm{f}_p) p(\bm{f}_p) \cdot {\rm d}\bm{f_p} \tag{2.25}$

The exact expression for the training condition and testing condition can be derived from a noise-free GP.

Under the ASSUMPTION that both the training and the test conditionals are deterministic:

$q(\bm{f}_x|\bm{f}_p)=\mathcal{N}(K_{xp}K_{pp}^{-1}\bm{f}_p,0) \tag{2.28}$

$q(\bm{f}_*|\bm{f}_p)=\mathcal{N}(K_{*p}K_{pp}^{-1}\bm{f}_p,0) \tag{2.29}$

Effective prior:

$q(\bm{f}_x,\bm{f}_*) = \mathcal{N}([f_x;f_*]|0,[Q_{xx},Q_{x*};Q_{*x},Q_{**}]) \tag{2.30}$

where $Q_{ab}=K_{ap}K_{pp}^{-1}K_{pb}$ is the equivalent covariance function. （？？猜测这一步应该是经过了配方法求解Eq.2.25）

$\mathbb{E}(\bm{f}_*) = \sigma^{-2}K_{*p}\Sigma_p^{-1}K_{px}\bm{y} \tag{2.34}$

$\mathbb{V}(\bm{f}_*) = K_{*p}\Sigma_p^{-1}K_{p*} \tag{2.35}$

This is referred to as the subset of regressors (SoR) sparse Gaussian process (Candela and Rasmussen, 2005), which is identical to the low rank approximation method except that the set of inducing points are now optimised as part of the hyperparameter set. Thus, the low rank approximation method is a sparse GP where both the training and test conditionals are deterministic and the set of inducing points are held fixed. (p.20)

### 2.3.2 FTIC (Fully Independent Training Conditional) / SPGP (Sparse Pseudo-input Gaussian Processes)

ASSUMPTION on training and testing condition (recall Eq.2.28 and 2.29):

$q(\bm{f}_x|\bm{f}_p) = \mathcal{N}(K_{xp}K_{pp}^{-1}\bm{f}_p,\text{diag}[K_{xx}-K_{xp}K_{pp}^{-1}K_{px}]) \tag{2.36}$

$q(\bm{f}_*|\bm{f}_p) = p(\bm{f}_*|\bm{f}_p) \tag{2.37}$

$p(\bm{y}|\bm{f}_x) \approx q(\bm{y}|\bm{f}_x) = \mathcal{N}(\bm{f}_x,\Lambda) \tag{2.38}$

where $\Lambda = \text{diag}[K_{xx}-Q_{xx}]+\sigma^2I_n$.

## 2.4 Basic Function Models

20201002 重新梳理思路：

–> 定义 BFM 模型中 w 的 prior $p( w | \alpha )$ (此时 GP 的 prior 已经在 full GP 定义过了)
–> 计算 evidence (marginal likelihood) $p( y | X, \theta )$：把 $p( f | X, \theta, w )$ 中的 $w$ 积分掉；把 $p( y | f_x, X, \theta)$ 中的 $f_x$ 积分掉
–> 用 Bayes’ theorem 计算 $w$ 的 posterior $p(w|y,X,\theta,\alpha)= \frac{p(y|w,X,\theta,\alpha) p(w|\alpha)}{p(y|X,\theta,\alpha)}$ （右边这几项都有了，所以可以算出来）
–> 用 Bayes’ theorem 也可以关于 $w$ 计算 evidence \$p( y | X, \theta )，即拆分 $p( y, w | X, \theta, \alpha )$ 得到 evidence
–> 得到较简单形式的 marginal likelihood function

BFM: linear combination of $m$ non-linear basis functions plus additive noise

$y_i = \bm{\phi}(\bm{x}_i)^T\bm{w} + \varepsilon_i$

$\bm{\phi}(\bm{x}_i) = [\phi_1(\bm{x}_i),\cdots,\phi_m(\bm{x}_i)]^T \in \mathbb{R}^m, \quad m\ll n$

$\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$

LIKELIHOOD: （应该是通过配方法得到的）

$p(\bm{y}|\bm{X},\bm{\theta},\bm{w}) = \int p(\bm{y}|\bm{f}_\bm{x},\bm{\theta}) \cdot p(\bm{f}_\bm{x}|\bm{X},\bm{\theta},\bm{w}) \cdot \rm{d}\bm{f}_\bm{x}$

$= \int \mathcal{N}(\bm{y}|\bm{f}_\bm{x},\sigma^2 I_n) \cdot \mathcal{N}(\bm{f}_\bm{x}|\bm{\Phi}_\bm{x}^T\bm{w},0) \cdot \rm{d}\bm{f}_\bm{x}$

$= \mathcal{N}(\bm{y}|\bm{\Phi}_\bm{x}^T\bm{w},\sigma^2 I_n) \tag{2.47}$

$\bm{\Phi}_\bm{x} = \left[ \begin{matrix} \phi_1(\bm{x}_1) & \cdots & \phi_1(\bm{x}_n) \\ \vdots & \ddots & \vdots \\ \phi_m(\bm{x}_1) & \cdots & \phi_m(\bm{x}_n) \end{matrix} \right]$

PRIOR: (assumed to be smooth)

$p(\bm{w}|\alpha) = \mathcal{N}(\bm{w}|0,\bm{A}^{-1})$

$\bm{A}=\alpha I_m$

$p(\bm{f}_x|X,\bm{\theta}) = \int p(\bm{f}_x|X,\bm{\theta},\bm{w}) p(\bm{w}|\alpha) \cdot {\rm d}\bm{w} \tag{2.51.改}$

$= \mathcal{N}(\bm{f}_x|0,\Phi_x^TA^{-1}\Phi_x) \tag{2.54.改}$

MARGINAL LIKELIHOOD (from Wiki: In the context of Bayesian statistics, it may also be referred to as the evidence or model evidence.):

$p(\bm{y}|X,\bm{\theta}) = \int p(\bm{y}|\bm{f}_x,\bm{\theta})p(\bm{f}_x|X,\bm{\theta}) \cdot {\rm d}\bm{f}_x \tag{2.55}$

$= \mathcal{N}(\bm{y}|0,\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n) \tag{2.56}$

[This marginal likelihood function should be maximized in order to get all hyperparameters.]

[另一种推导 marginal likelihood 表达式的方法。没有看明白。感觉有循环推导的嫌疑。]

[从另一个思路来直接简化 Eq.2.56，通过代入一个特殊的 $\bm{w}$。]

POSTERIOR distribution of $\bm{w}$: (using $\beta=\sigma^{-2}$)

$p(\bm{w}|\bm{y},X,\bm{\theta}) = \frac{p(\bm{y}|X,\bm{\theta},\bm{w})p(\bm{w}|\alpha)}{p(\bm{y}|X,\bm{\theta})} = \mathcal{N}(\bm{w}|\bm{\bar{w}},\Sigma_w^{-1}) \tag{2.57/58}$

$\bm{\bar{w}} = \beta\Sigma_w^{-1}\Phi_x\bm{y} \tag{2.59}$

$\Sigma_w = \beta\Phi_x\Phi_x^T+A \tag{2.60}$

MARGINAL LIKELIHOOD:

$p(\bm{y}|X,\bm{\theta}) = \frac{p(\bm{y}|X,\bm{\theta},\bm{w})p(\bm{w}|\alpha)}{p(\bm{w}|\bm{y},X,\bm{\theta})} \tag{2.61}$

$= \frac{\mathcal{N}(\bm{y}|\Phi_x^T\bm{w},\beta^{-1}I_n)\mathcal{N}(\bm{w}|0,A^{-1})}{\mathcal{N}(\bm{w}|\bm{\bar{w}},\Sigma_w^{-1})} \tag{2.62}$

Substituting $\bm{w} = \bm{\bar{w}}$ [实际上是取了最大后验分布的 $\bm{w}$，参见 bishop-pattern-2006 中 Eq.3.49 附近的的讨论。这部分是否与 $\bm{w}_\text{MAP}$ 有关系？需要继续研究。] [对任意的 $\bm{w}$ 成立，取特值达到化简的目的], get

$p(\bm{y}|X,\bm{\theta}) = \mathcal{N}(\bm{y}|\Phi_x^T\bar{\bm{w}},\beta^{-1}I_n) \mathcal{N}(\bm{\bar{w}}|0,A^{-1})(2\pi)^{m/2}|\Sigma_w|^{-1/2} \tag{2.63}$

$\text{Exponential terms of Eq.2.62} = \exp\left[ -\frac{1}{2} \left[ (\bm{y}-\Phi_x^T\bm{w})^T\beta I_n(\bm{y}-\Phi_x^T\bm{w}) + \bm{w}^TA\bm{w} - (\bm{w}-\bar{\bm w})^T\Sigma_w(\bm{w}-\bar{\bm w}) \right] \right]$

$-2 \cdot \text{(inside of exp)} = \beta y^Ty - 2\beta\bm{w}^T\Phi_xy + \beta\bm{w}^T\Phi_x\Phi_x^T\bm{w}+\bm{w}^TA\bm{w} - w^T\Sigma_ww+ 2w^T\Sigma_w\bar{w} - \bar{w}^T\Sigma_w\bar{w} \tag{expand}$

$= \beta y^Ty - 2\beta\bm{w}^T\Phi_xy + \bm{w}^T(\beta\Phi_x\Phi_x^T+A)\bm{w} - w^T\Sigma_ww+ 2w^T\Sigma_w\bar{w} - \bar{w}^T\Sigma_w\bar{w} \tag{collect.cancel}$

$= \beta y^Ty - 2\beta\bm{w}^T\Phi_xy + 2w^T\Sigma_w(\beta\Sigma_w^{-1}\Phi_x\bm{y}) - \bar{w}^T\Sigma_w\bar{w} \tag{collect.cancel}$

$= \beta y^Ty - \bar{w}^T(\beta\Phi_x\Phi_x^T+A)\bar{w}$

$= (\bm{y}-\Phi_x^T\bm{\bar{w}})^T\beta I_n(\bm{y}-\Phi_x^T\bm{\bar{w}}) + 2\bar{w}^T\Phi_x\beta\bm{y} - 2\bar{w}^T\Phi_x\Phi_x^T\bar{w} - \bar{w}^TA\bar{w} \tag{complete.square}$

$= (\bm{y}-\Phi_x^T\bm{\bar{w}})^T\beta I_n(\bm{y}-\Phi_x^T\bm{\bar{w}}) + 2\bar{w}^T\Phi_x\beta\bm{y} - 2\bar{w}^T\Sigma_w\beta\Sigma_w^{-1}\Phi_x\bm{y} + \bar{w}^TA\bar{w} \tag{substitute.cancel}$

$= (\bm{y}-\Phi_x^T\bm{\bar{w}})^T\beta I_n(\bm{y}-\Phi_x^T\bm{\bar{w}}) + \bar{w}^TA\bar{w} \tag{another.square}$

[问题：直接代入 $\bar{w}$ 得到了同样的结果，与直接计算的结果相同。直接代入的数学依据是什么？待查资料。 与超哥讨论得知：实际是因为 Eq.2.62 对任意 $\bm{w}$ 成立，取特值代入，达到化简的目的。]

LOG MARGINAL LIKELIHOOD:

$\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{\alpha}{2}\|\bm{\bar{w}}\| + \frac{m}{2}\ln\alpha - \frac{1}{2}\ln|\Sigma_w| \tag{2.64}$

[我的更正：in Eq.2.64: $\Phi_x\bar{\bm{w}}\rightarrow\Phi_x^T\bar{\bm{w}}$]

### 验证 Eq.2.64 和 ln(Eq.2.56) 等价

r.h.s of Eq.2.64:

$-2\cdot term1 -2\cdot term4 = \beta (\Phi_x^T\bar{w}-\bm{y})^T(\Phi_x^T\bar{w}-\bm{y}) + \alpha \bar{w}^T\bar{w}$

$= \beta\bm{y}^T\bm{y} + \beta( \bar{w}^T\Phi_x\Phi_x^T\bar{w} - 2 \bar{w}^T\Phi_x\bm{y} ) + \alpha\bar{w}^T\bar{w} \tag{expand}$

$= \beta\bm{y}^T\bm{y} + \beta y^T[\beta^2\Phi_x^T\Sigma_w^{-T}\Phi_x\Phi_x^T\Sigma_w^{-1}\Phi_x]y - 2\beta y^T[\beta\Phi_x^T\Sigma_w^{-T}\Phi_x]y+ \alpha y^T[\beta^2\Phi_x^T\Sigma_w^{-T}\Sigma_w^{-1}\Phi_x]y \tag{expand}$

$= \beta\bm{y}^T\bm{y} + \beta^2\bm{y}^T [\beta\Phi_x^T\Sigma_w^{-T}\Phi_x\Phi_x^T\Sigma^{-1}\Phi_x-2\Phi_x^T\Sigma_w^{-T}\Phi_x+\alpha\Phi_x^T\Sigma_w^{-T}\Sigma^{-1}\Phi_x] \bm{y} \tag{collect}$

$= \beta\bm{y}^T\bm{y} + \beta^2\bm{y}^T \left[ \Phi_x^T\Sigma_w^{-T}( \beta\Phi_x\Phi_x^T+\alpha I_n )\Sigma^{-1}\Phi_x -2\Phi_x^T\Sigma_w^{-T}\Phi_x \right] \bm{y} \tag{collect.inside}$

$= \beta\bm{y}^T\bm{y} + \beta^2\bm{y}^T \left[ \Phi_x^T\Sigma_w^{-T}\Sigma\Sigma^{-1}\Phi_x -2\Phi_x^T\Sigma_w^{-T}\Phi_x \right] \bm{y} \tag{cancel.inverse}$

$= \beta\bm{y}^T\bm{y} - \beta^2\bm{y}^T ( \Phi_x^T\Sigma_w^{-1}\Phi_x ) \bm{y} \tag{proof}$

ln of Eq.2.56

$p(\bm{y}|X,\bm{\theta})=\frac{1}{2\pi^{n/2}} \frac{1}{|\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n|^{1/2}} \exp \left[ -\frac{1}{2} \bm{y}^T (\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n)^{-1}\bm{y} \right] \tag{2.56.expansion}$

$\ln p(\bm{y}|X,\bm{\theta}) = -\frac{n}{2}\ln2\pi - \frac{1}{2}\ln|\Phi_x^{T}A^{-1}\Phi+\sigma^2I_n| -\frac{1}{2} \bm{y}^T (\Phi_x^TA^{-1}\Phi_x+\sigma^2I_n)^{-1}\bm{y} \tag{2.56.log}$

r.h.s of Eq.2.56.log

$-2\cdot term2 = \ln|\sigma^2I_n| + \ln|A^{-1}| +\ln |A+\Phi_x\sigma^{-2}I_n\Phi_x^T|$

$= -n\ln\beta - m\ln\alpha + \ln|\Sigma_w|$

$-2\cdot term3 = \bm{y}^T (\sigma^{-2}I_n - \sigma^{-2}I_n\Phi_x^T(A+\Phi_x\sigma^{-2}I_n\Phi_x^T)^{-1}\Phi_x\sigma^{-2}I_n) \bm{y}$

$= \bm{y}^T (\beta I_n - \beta^2 \Phi_x^T(A+\beta\Phi_x\Phi_x^T)^{-1}\Phi_x) \bm{y}$

$= \beta\bm{y}^T\bm{y} - \beta^2\bm{y}^T\Phi_x^T\Sigma_w^{-1}\Phi_x\bm{y} \tag{same.to.proof}$

Other terms can be easily matched. So, they are equivalent.

PREDICTIVE DISTRIBUTION:

$p(\bm{y}_*|X_*,X,\bm{y},\bm{\theta}) = \mathcal{N}(\bm{y}_*|\mathbb{E}(\bm{y}_*),\mathbb{V}(\bm{y}_*)) \tag{2.70}$

$\mathbb{E}(\bm{y}_*) = \Phi_* \bar{\bm{w}} \tag{2.71}$

$\mathbb{V}(\bm{y}_*) = \Phi_*^T\Sigma_w^{-1}\Phi_* + \sigma^2 I_n \tag{2.72}$

Equivalent to SoR with a different prior on the latent variables: (这部分不是太理解)

$q(\bm{f}_x|\bm{f}_p,X,\bm{\theta}) = \mathcal{N}(\bm{f}_x|\Phi_xA^{-1}\bm{f}_p,0) \tag{2.73}$

$q(\bm{f}_*|\bm{f}_p,X_*,\bm{\theta}) = \mathcal{N}(\bm{f}_*|\Phi_*A^{-1}\bm{f}_p,0) \tag{2.74}$

$q(\bm{f}_p|\bm{\theta}) = \mathcal{N}(\bm{f}_p|0,A) \tag{2.75}$

Similarly, SoR is a BFM with a prior on weights: (这部分不是太理解)

$p(\bm{w}|\bm{\theta}) = \mathcal{N}(\bm{w}|0,K_{pp}^{-1})$

# 3 Extending Sparse Gaussian Process

The remaining sections, unifying ARD with the other features as a single sparse GP framework, are new contributions by this thesis. (p.31)

## 3.1 Automatic Relevance Determination

PRECISION matrix: (a precision parameter per weight)

$A = \text{diag}[\bm{\alpha}],\quad \bm{\alpha}=[\alpha_1,\dots,\alpha_m]^T$

New LOG MARGINAL LIKELIHOOD function: (the old one is Eq.2.64; replace all $\alpha I_n$ with $A$)

$\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w| \tag{3.1}$

$\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{\alpha}{2}\|\bm{\bar{w}}\| + \frac{m}{2}\ln\alpha - \frac{1}{2}\ln|\Sigma_w| \tag{2.64.repeat.for.compare}$

(关于 RVM 的讨论。更多介绍参考 Tipping 的文章或者 Bishop-pattern-2006) A similar approach was proposed by Tipping (2001), coined as the relevance vector machine (RVM), where the set of basis function locations P was set equal to the locations of the training samples X and held fixed. Only the precision parameter and the values are optimised to determine the relevant set of vectors from the training set. This approach is still computationally expensive and the work discussed in Tipping (2001) proposed an iterative workaround to add and remove vectors incrementally.

## 3.2 Heteroscedastic Noise

$\mathbb{V}(\bm{y}_*) = \Phi_*^T\Sigma_w^{-1}\Phi_* + \sigma^2 I_n \tag{2.72.repeat}$

The predictive variance in Equation (2.72) has two components:

• The first term, $\mathbb{V}(\bm{f}_*)=\Phi_*^T\Sigma_w^{-1}\Phi_*$, is the model variance. The model variance depends on the data density of the training sample at $\bm{x}_*$. Theoretically, this component of the model variance will go to zero as the size of the data set increases. This term hence models our underlying uncertainty about the mean function. The model becomes very con dent about the posterior mean when presented with a large number of samples at $\bm{x}_*$, at which point the predictive variance reduces to the intrinsic noise variance.
• The second term, $\sigma^2I_n$, is the noise uncertainty. At this point, it is assumed to be white Gaussian noise with a fixed precision $\beta = \sigma^{-2}$.

model the function as a linear combination of basis functions: (choose the exponential form to ensure positivity)

$\beta(\bm{x}) = \exp \left[ \bm{\phi}^T(\bm{x})\bm{v} + b \right] \tag{p.33}$

New LIKELIHOOD function: (the old one is Eq.2.47)

$p(\bm{y}|X,\bm{\theta},\bm{w}) = \mathcal{N}(\bm{y}|\Phi_x\bm{w},B^{-1}) \tag{3.3}$

$p(\bm{y}|X,\bm{\theta},\bm{w}) = \mathcal{N}(\bm{y}|\bm{\Phi}_\bm{x}^T\bm{w},\sigma^2 I_n),\quad\sigma^2I_n = \beta^{-1}I_n \tag{2.47.repeat.for.compare}$

where $B=\text{diag}[\beta_1, \dots, \beta_n]$ is the precision matrix.

New POSTERIOR distribution of $\bm{w}$: (the old ones are Eq.2.58–60; replace $\beta I_n$ with $B$.)

$p(\bm{w}|\bm{y},X,\bm{\theta}) = \mathcal{N}(\bm{w}|\bm{\bar{w}},\Sigma_w^{-1}) \tag{3.4}$

$\bm{\bar{w}} = \Sigma_w^{-1}\Phi_xB\bm{y} \tag{3.5}$

$\Sigma_w = \Phi_x B\Phi_x^T + A \tag{3.6}$

New LOG MARGINAL POSTERIOR distribution of $\bm{w}$: (replace $\beta I_n$ with $B$)

$\ln p(\bm{y}|X,\bm{\theta}) = -\frac{1}{2}\bm{\delta}^TB\bm{\delta} + \frac{1}{2}\ln|B| -\frac{n}{2}\ln2\pi - \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w| \tag{3.7}$

$\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w| \tag{3.1.repeat.for.compare}$

$\ln p(\bm{y}|X,\bm{\theta}) = -\frac{\beta}{2}\|\Phi_x^T\bm{\bar{w}}-\bm{y}\|^2 + \frac{n}{2}\ln\beta -\frac{n}{2}\ln2\pi - \frac{\alpha}{2}\|\bm{\bar{w}}\| + \frac{m}{2}\ln\alpha - \frac{1}{2}\ln|\Sigma_w| \tag{2.64.repeat.for.compare}$

where $\delta = \bm{y} - \Phi_x^T\bm{\bar{w}}$.

A prior on the weights $\bm{v}$ in Eq.p.33 to favour the simplest precision function $\beta_i$, namely that $\bm{v}$ is normally distributed with a mean of 0 and a diagonal precision matrix $T = \text{diag}[\tau_1, \dots, \tau_m]$:

$p(\bm{v}|\bm{\tau}) = \mathcal{N}(\bm{v}|0,T) \tag{p.35}$

Since

$\ln p(\bm{y}|X,\bm{\theta}) = \ln \left( p(\bm{y}|X,\bm{\theta},\bm{\nu})p(\bm{\nu}|\bm{\tau}) \right)$

the final new log marginal likelihood function is:

$\mathcal{L}(\mathcal{D},\bm{\theta}) = \ln p(\bm{y}|X,\bm{\theta}) = -\frac{1}{2}\bm{\delta}^TB\bm{\delta} + \frac{1}{2}\ln|B| -\frac{n}{2}\ln2\pi$

$- \frac{1}{2}\bm{\bar{w}}^TA\bm{\bar{w}} + \frac{1}{2}\ln|A| - \frac{1}{2}\ln|\Sigma_w|$

$-\frac{1}{2}\bm{v}^TT\bm{v} + \frac{1}{2}\ln|T| - \frac{m}{2}\ln2\pi \tag{3.9}$

where $\bm{\tau}$ acts as an automatic relevance determination cost for the noise process, allowing the objective to dynamically select different sets of relevant basis functions for both the posterior mean and variance estimation.

All hyperparameters are

$\bm{\theta} = \left[ \bm{\varphi}, \bm{\alpha}, \bm{\nu}, b \right]$

New predictive distribution of one sample $\bm{x}_*$,

$p(y_*|\bm{x}_*,\mathcal{D},\bm{\theta}) = \mathcal{N}(y_*|\mathbb{E}(y_*),\mathbb{V}(y_*)) \tag{3.10}$

$\mathbb{E}(y_*) = \bm{\phi}(x_*)^T \bm{\bar{w}}, \quad\bm{\bar{w}}\text{ is obtained through training} \tag{3.11}$

$\mathbb{V}(y_*) = \bm{\phi}(x_*)^T \Sigma_w^{-1} \bm{\phi}(x_*) + \beta^{-1}(\bm{x}_*), \quad\beta\text{ is obtained by Eq.p.33} \tag{3.12}$

## 3.5 Prior Mean Function

$\mathcal{L}(\mathcal{D},\hat{\bm{\theta}}) = -\frac{1}{2} \hat{\bm{\delta}}^T\Omega B\hat{\bm{\delta}} + \frac{1}{2} {\rm tr}(\Omega\odot\ln B) - \frac{1}{2}{\rm tr}\ln(\Omega\odot\ln2\pi)$

$= \frac{1}{2} (\hat{\bar{\bm{w}}}-\hat{\bm{a}})^T\hat{A}(\hat{\bar{\bm{w}}}-\hat{\bm{a}}) + \frac{1}{2}\ln|\hat{A}| - \frac{1}{2}\ln|\hat{\Sigma_w}|$

$= \frac{1}{2}\bm{v}^TT\bm{v} + \frac{1}{2}\ln|T| - \frac{m}{2}\ln2\pi \tag{3.36}$

# 4 Optimisation

• BFGS
• L-BFGS

## 4.2 The General Case

Using the properties of matrix derivatives (Petersen and Pedersen, 2012) (学这本书中的矩阵微分性质)

## 4.3 The Sigmoidal Function

TIME COMPLEXITY:

Model | Time complexity

• ANN(l-layers) | O(nmd+(l-1)(nm^2))
• FITC | $O(nmd+nm^2)$
• GPVL | $O(nmd+nm^2)$
• GPVD | $O(nmd+nm^2)$
• GPVC | $O(nmd^2+nm^2)$
• Full GP | $O(n^3)$

where $d$ is the dimension of input $\bm{x}$, $n$ is the number of training data, $m$ is the number of basis functions.

# 5 Noisy or Missing Variables

(p.63) We will first address how to predict with known input noise in Section 5.1.1, since this can be addressed independently of the training data.

(p.64) In Section 5.2.1, we consider the case in which training data inputs are uncertain by approximating the expected value of the marginal likelihood.

• 是不是要先知道输入的分布才行？
• 为什么和 training 是分开的？

(p.70) It requires us to have a probabilistic model for each possible set of missing variables, which grows exponentially large as the dimensionality of the input increases.

# 6 Photometric Redshift Estimation

(p.126) The former, the model uncertainty, helps in identifying regions where more data is needed, whereas the latter, the intrinsic noise uncertainty, helps in identifying regions where better or more precise features are needed.

(p.133) Finally, we introduced the capability of handling missing photometry, not present in any other method apart from random forests. This is particularly useful when using a model trained on one survey to predict the photometric redshift on another survey that does not share the same photometry but overlaps.