GPstuff Learning Notes (GPstuff document v4.6)
阅读 GPstuff 包中的函数的源代码,对其中注释做的笔记,在 official document 中有些内容没有写。
理论部分摘抄和注释
From prior to posterior predictive
Bayes
Bayes inference
Observation model:
GP prior:
hyperprior:
The latent function value at fixed is called a latent variable.
Any set of function values has a multivariate Gaussian distribution
Predict the values at new input locations , the joint distribution is
The conditional distribution of given is
So the conditional distribution of the latent function is also a GP with
- conditional mean function:
- conditional covariance function: (不确定等号左边的符号是否正确)
以上只是纯理论推导,在没有获得相应的观察的情况下,所以没有涉及到,只涉及到了 latent variables and latent function 。
以下开始考虑取得观测数据后的推理。
First inference step is to form the conditional posterior of the latenet variables given the parameters () (这里暂时假设已经取得,依赖于 observation model 的选择或设计,在后面会讨论如何计算,实际上除了经典GP都需要用近似方法)
After this, we can marginalize over the parameters to obtain the marginal posterior distribution for the latent vriables
The conditional posterior predictive distribution can be evaluated exactly or approximated, (同样,在后面会讨论如何计算,这里暂时假设已经取得)
The posterior predictive distribution is obtained by marginalizing out the parameters from .
The posterior joint predictive distribution requires integration over . (Usually not used.)
The marginal predicted distribution for individual is
If the parameters are considered fixed, using GP’s marginalization and conditionalization properties (still Gaussian), we can evaluate the posterior predictive mean from the conditional mean (where ) (前面推导已经得到), through marginalizing out the latent variables ,
The posterior predictive covariance between any set of latent variables is (这一步的推导是利用了Wikipedia: Law of total covariance)
Then, the posterior predictive covariance function is
So, even if the exact posterior
From latents to observations
Bayes
Gaussian observation model:
Marginal likelihood is
The conditional posterior of latent variables has analytical solution now, (should be done through completing the square. Bishop’s book or GPML book should have details.)
Since the conditional posterior of is Gaussian, the posterior process is still a GP, whose mean and covariance function is obtained from Eqs. (11) and (13).
以上公式(15)直接给出了 和 ,代入到(11)和(13)中就得到 和 。
Digging into Demos
demo_inputdependentnoise.m
All 'type','FULL'
.
lik_inputdependentnoise
+ gpcf_sexp + gpcf_exp
(prior_t
for lengthScale_prior
and magnSigma2_prior
) + 'latent_method', 'Laplace'
+ gp_optim
lik_t
+ gpcf_sexp
+ 'latent_method', 'Laplace'
+ gp_optim
1D
and 2D
data
(line 241) if flat priors are used, there might be need to increase gp.latent_opt.maxiter for laplace algorithm to converge properly.
gp.latent_opt.maxiter=1e6
; (??这有可能是我算不出来的原因)
demo_regression_robust.m
All 'type','FULL'
.
lik_gaussian
(prior_logunif
) + gpcf_sexp
(prior_t
and prior_sqrtunif
) + gp_optim
lik_t
(prior_loglogunif
, prior_logunif
) + gpcf_sexp
+ 'latent_method', 'EP'
+ gp_optim
lik_t
(prior_logunif
) + gpcf_sexp
+ 'latent_method', 'MCMC'
+ gp_mc
研究具体代码实现
Note! If the prior is ‘prior_fixed’ then the parameter in question is considered fixed and it is not handled in optimization, grid integration, MCMC etc.
整个代码包用 structure 实现了类似 OOP 的形式,因为开发时候 MATLAB 对 OOP 的支持还很差。 导致代码很难定位。
设置GP结构gp_set
type
- Type of Gaussian process
- ‘FULL’ full GP (default)
- ‘FIC’ fully independent conditional sparse approximation(需要inducing point
X_u
) - ‘PIC’ partially independent conditional sparse approximation
- ‘CS+FIC’ compact support + FIC model sparse approximation
- ‘DTC’ deterministic training conditional sparse approximation
- ‘SOR’ subset of regressors sparse approximation
- ‘VAR’ variational sparse approximation
infer_params
- String defining which parameters are inferred. The default is covariance+likelihood
.
- ‘covariance’ = infer parameters of the covariance functions
- ‘likelihood’ = infer parameters of the likelihood
- ‘inducing’ = infer inducing inputs (in sparse approximations): W = gp.X_u(😃
- ‘covariance+likelihood’ = infer covariance function and likelihood parameters (有什么具体的区别?不是很明白)
- ‘covariance+inducing’ = infer covariance function parameters and inducing inputs
- ‘covariance+likelihood+inducing’
The additional fields when the likelihood is not Gaussian (lik is not
lik_gaussian
orlik_gaussiansmt
) are:
latent_method
and latent_opt
latent_method
- Method for marginalizing over latent values (什么意思?用likelihood计算predictive时需要对进行marginalization,需要对latent value进行积分。参见GPstuff Doc Eq. (10) ). Possible methods are ‘Laplace’ (default), ‘EP’ and ‘MCMC’.
latent_opt
- Additional option structure for the chosen latent method. See default values for options below.
- ‘MCMC’
- method - Function handle to function which samples the latent values @esls (default), @scaled_mh or @scaled_hmc
- f - 1xn vector of latent values. The default is [].
- ‘Laplace’
- optim_method - Method to find the posterior mode: ‘newton’ (default except for lik_t), ‘stabilized-newton’, ‘fminuc_large’, or ‘lik_specific’ (applicable and default for lik_t)
- tol
- ‘EP’
- ‘robust-EP’
The additional fields needed in sparse approximations are:
X_u
- Inducing inputs, no default, has to be set when FIC, PIC, PIC_BLOCK, VAR, DTC, or SOR is used.
Xu_prior
- Prior for inducing inputs. The default is prior_unif.
gp_optim
Optimize paramaters of a Gaussian process
gp_mc
-
hmc_opt
- Options structure for HMC sampler (see hmc2_opt). When this is given the covariance function and likelihood parameters are sampled with hmc2 (respecting infer_params option). -
sls_opt
- Options structure for slice sampler (see sls_opt). When this is given the covariance function and likelihood parameters are sampled with sls (respecting infer_params option). -
latent_opt
- Options structure for latent variable sampler. When this is given the latent variables are sampled with function stored in the gp.fh.mc field in the GP structure. See gp_set. (在gp_set
中设置的'latent_method','MCMC','latent_opt',struct('method',@scaled_mh)
与这里的latent_opt
不同!!比如在这个例子中,这里的latent_opt
实际是设置scaled_mh
的 option。这里容易混淆!) -
lik_hmc_opt
- Options structure for HMC sampler (see hmc2_opt). When this is given the parameters of the likelihood are sampled with hmc2. This can be used to have different hmc options for covariance and likelihood parameters. -
lik_sls_opt
- Options structure for slice sampler (see sls_opt). When this is given the parameters of the likelihood are sampled with hmc2. This can be used to have different hmc options for covariance and likelihood parameters. -
lik_gibbs_opt
- Options structure for Gibbs sampler. Some likelihood function parameters need to be sampled with Gibbs sampling (such as lik_smt). The Gibbs sampler is implemented in the respective lik_* file.
*_pak
and *_unpak
- Combine * parameters into one vector.
- Extract * parameters from the vector.
For lik_*_pak
and lik_*_unpak
, this is a mandatory subfunction used for example in energy and gradient computations (calculated by gp_eg
through calling gp_e
and gp_g
).
我要用的 likelihood function
lik_gaussian
Create a Gaussian likelihood structure
- sigma2 - variance [0.1]
- sigma2_prior - prior for sigma2 [prior_logunif] (相当于默认是uniform的,所以并不是在完全uniform的?)
- n - number of observations per input (See using average observations below) (不要用这个参数,这个是用来平均 sigma2 的。)
lik_t
Create a Student-t likelihood structure
1 | % __ n |
Parameters for Student-t likelihood [default]
- sigma2 - scale squared [1]
- nu - degrees of freedom [4] (这是 degree of freedom 通常是固定的)
- sigma2_prior - prior for sigma2 [prior_logunif] (为什么是logunif?)
- nu_prior - prior for nu [prior_fixed]
Can be infered by:
- Laplace approximation (need
lik_t_ll
,lik_t_llg
,lik_t_llg2
,lik_t_llg3
) - MCMC (need
lik_t_ll
,lik_t_llg
) - EP (need
lik_t_llg2
,lik_t_tiltedMoments
,lik_t_siteDeriv
) - robust-EP (need
lik_t_tiltedMoments2
,lik_t_siteDeriv2
)
lik_gaussiansmt
Create a Gaussian scale mixture likelihood structure with priors producing approximation of the Student’s t
The parameters of this likelihood can be inferred only by Gibbs sampling by calling GP_MC.
我要用的 covariance function
gpcf_sexp
Create a squared exponential (exponentiated quadratic) covariance function
- magnSigma2 - magnitude (squared) [0.1]
- lengthScale - length scale for each input. [1] This can be either scalar - corresponding to an isotropic function or vector defining own length-scale for - each input direction. (为每个输入定义不同的length scale,自动选择)
- magnSigma2_prior - prior for magnSigma2 [prior_logunif] (为什么是logunif?保正?)
- lengthScale_prior - prior for lengthScale [prior_t] (为什么是prior_t?不需要保正?)
- metric - metric structure used by the covariance function [] (不懂)
- selectedVariables - vector defining which inputs are used [all] selectedVariables is short hand for using metric_euclidean with corresponding components
- kalman_deg - Degree of approximation in type ‘KALMAN’ [6](不懂)
实际使用中,把 demo_regression1.m
中的
lengthScale_prior
从 prior_unit
换成 prior_logunit
,
magnSigma2_prior
从 prior_logunit
换成 prior_unit
,
对 MAP 结果的影响并不大。(猜测是因为在优化过程中,并没有取到对应的负的值,所以没有产生错误,因为在 inputParser.parse
中明确要求了 magnSigma2
和 lengthScale
必须为正。
子函数
gpcf_sexp_lp
: Evaluate the log prior of covariance function parameters, returns
我要用的 priors
prior_unif
prior_sqrtunif
Uniform prior structure for the square root of the parameter
意思是参数如果是 ,那么 。 适合于比较把 整体当作一个参数的情况,然后要求 是均匀分布。(但是这里应该是要求正数才对吧?)
prior_logunif
Uniform prior structure for the logarithm of the parameter
意思是参数如果是 ,那么 。
prior_t
Student-t prior structure
Parameters for Student-t prior [default]
- mu - location [0]
- s2 - scale [1]
- nu - degrees of freedom [4]
- mu_prior - prior for mu [prior_fixed] (这里居然是 fixed,为什么?是否合理?)
- s2_prior - prior for s2 [prior_fixed] (这里居然是 fixed,为什么?是否合理?)
- nu_prior - prior for nu [prior_fixed] (这里居然是 fixed,为什么?是否合理?)
如果参数是 ,那么 。 默认参数都是固定的。 (是任意的?是正的?必须是整数)
在prior_t_pak
中对s2
和nu
进行了log
变换
查demo中对prior_t
的prior是怎么设置的。
/Volumes/ExternalDisk/git-collections/gpstuff/gp/demo_hierprior.m
pl=prior_t(‘mu_prior’,prior_t); 未看
Other hidden functions
gp_eg
calls gp_e
, gp_g
- GP_EG: Evaluate the energy function (un-normalized negative marginal log posterior) and its gradient
- GP_E: Evaluate the energy function (un-normalized negative log marginal posterior)
- GP_G: Evaluate the gradient of energy (GP_E) for Gaussian Process
The energy is minus log posterior cost function:
where represents the parameters (lengthScale, magnSigma2…), is inputs and is observations (regression) or latent values (non-Gaussian likelihood).
目前实验碰到的一些结论和问题
- (not sure why) 不用
lik_gaussiansmt
lik_inputdependentnoise()
只支持'type','FULL'
,参见gpla_e.m
line 162 的 switch,在 FIC 没有相应的支持'latent_method','MCMC'
不能和gp_optim
配合使用:(100% sure)- 设置
FIC
时,在gp_g <- gp_eg
中 line 556 只计算 gradient w.r.t. Gaussian likelihood function parameters,没有考虑 non-Gaussian + MCMC 的情况。 PIC
line 755 同样没有CS+FIC
line 996 同样没有DTC
,VAR
,SOR
line 1179 同样没有KALMAN
不确定
- 设置
- derivative observations have not been implemented for sparse GPs !!! (see
gp_trcov.m
line 54) - 在
gp_set/latent_method
设置用来 sample latent variables 的算法。 (seegp_mc
line 341) - 在使用
lik_t
+gp_mc
时候,必须显示地设置latent_opt
和lik_hmc_opt
,不然在gp_mc
中采样时候会出现少对一种进行采样的情况。(不确定)(也不确定对其它likelihood的情况)
经验
1 | Matrix dimensions must agree. |
- 报这个错误通常是因为在
gp_g
中没有处理相应的 gradient,注意选择匹配的 latent_method。
实验记录
FULL
+lik_gaussian
+gpcf-sexp
+MAP
成功FULL
+lik_gaussian
+gpcf-sexp
+MCMC
成功FIC
+lik_gaussian
+gpcf-sexp
+MAP
成功FIC
+lik_gaussian
+gpcf-sexp
+MCMC
成功FULL
+lik_t
+jitter-1e-3
+ARD
+Laplace
+ train-only-around-15-steps: 效果完美,输出的1,2,3-SD coverage始终是100% (为什么overfitting会这么严重???)FULL
+lik_t
+jitter-1e-3
+Laplace
+ train-only-9-steps: 需要把 jitter 减小,有效果,但不好。FIC-25
+lik_t
+jitter-1e-3
+ARD
+Laplace
+ train-only-around-15-steps:FULL
+lik_t
+jitter-1e-6
+ARD
+Laplace
+ train-200 + sample-100-no-thining: 效果很好,PML=1.29%, QML123=100%,大约1056秒FIC-25
+covariance+likelihood+inducing
+lik_t
+jitter-1e-6
+ARD
+Laplace
+ train-200 + sample-100-thin-60-1: 效果很好,PML=3.97%,QML123=100%,大约需要计算4个小时
不成功的实验:
10. FIC-25
+ covariance+likelihood+inducing
+ lik_t
+ jitter-1e-6
+ ARD
+ Laplace
+ train-1000 + sample-100-thin-60-1: 算了14个多小时只采样完16个……
11. FULL
+ lik_t
+ jitter-1e-6
+ ARD
+ 'latent_method', 'MCMC'
+ train-200 + sampleXXX: 相比于实验8,没有任何regression效果,怀疑某些参数的设置有问题
测试以下并总结原因:
FIC-25
+lik_t
+MCMC
+gp_mc
+GPz-init
:lengthScale 优化出来的结果都一样FULL
+lik_t
+Lapalace
+gpcf-sexp
+MAP
FULL
+lik_t
+Lapalace
+gpcf-sexp
+MCMC
FULL
+lik_t
+EP
+gpcf-sexp
+MAP
FULL
+lik_t
+EP
+gpcf-sexp
+MCMC
FULL
+lik_t
+MCMC
+gpcf-sexp
+MAP
FULL
+lik_t
+MCMC
+gpcf-sexp
+MAP
问题
怎么实现early stop?
怎么实现 relevance vector machine (RVM)?
怎么实现 GPz?