Browse Source

Add Thomas' feedback

Markus Kaiser 11 months ago
parent
commit
385a691ca5
4 changed files with 59 additions and 59 deletions
  1. 2
    2
      additional.bib
  2. BIN
      bayesian_warped_dependent_gp.pdf
  3. 25
    25
      bayesian_warped_dependent_gp.tex
  4. 32
    32
      zotero_export.bib

+ 2
- 2
additional.bib View File

@@ -1,6 +1,6 @@
1 1
 @book{Rasmussen:2005te,
2 2
 author = {Rasmussen, Carl Edward and Williams, Christopher K I},
3
-title = {{Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)}},
3
+title = {{{G}aussian Processes for Machine Learning (Adaptive Computation and Machine Learning)}},
4 4
 publisher = {The MIT Press},
5 5
 year = {2006}
6 6
 }
@@ -14,7 +14,7 @@ url = {http://jmlr.org/proceedings/papers/v33/duvenaud14.pdf}
14 14
 
15 15
 @incollection{NIPS2003_2481,
16 16
 author = {Snelson, Edward and Rasmussen, Carl Edward and Ghahramani, Zoubin },
17
-title = {{Warped Gaussian Processes}},
17
+title = {{Warped {G}aussian Processes}},
18 18
 year = {2004},
19 19
 pages = {337--344},
20 20
 publisher = {MIT Press}

BIN
bayesian_warped_dependent_gp.pdf View File


+ 25
- 25
bayesian_warped_dependent_gp.tex View File

@@ -96,7 +96,7 @@ This is very beneficial due to the marginalization property of a Gaussian which
96 96
 
97 97
 In this paper we combine the benefits of these approaches and propose a hierarchical, warped and multi-output Gaussian process.
98 98
 We derive an efficient learning scheme via an approximation to the marginal likelihood which allows us to fully exploit the regularisation provided by our structure.
99
-The model we propose is highly interpretable, able to learn from small amounts of data and generalises to a large range of different problem settings.
99
+The model we propose is highly interpretable, able to learn from small amounts of data and generalizes to a large range of different problem settings.
100 100
 
101 101
 
102 102
 \section{Model Definition}
@@ -140,7 +140,7 @@ Using the $K$-dimensional squared exponential kernel
140 140
 \end{split}
141 141
 \end{align}
142 142
 as the smoothing kernel function for all $T_{d, r}$, the integral can be shown to have a closed form solution.
143
-With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyperparameters associated with $T_{d,r}$, it is given by
143
+With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyper parameters associated with $T_{d,r}$, it is given by
144 144
 \begin{align}
145 145
 \label{eq:dependent_kernel}
146 146
 \begin{split}
@@ -210,7 +210,7 @@ While the original model in \cref{eq:full_model} can be recovered exactly by mar
210 210
 A central assumption of this approximation \parencite{titsias_variational_2009} is that given enough inducing variables at the correct location, they are a sufficient statistic for $\rv{\hat{a}}$, implying conditional independence of $\rv{\hat{a}}$ and $\mat{X}$ given $\rv{u}$.
211 211
 We introduce such inducing variables for every GP in the model, yielding the set $\Set{\rv{u_{a, d}}, \rv{u_{f, d}}, \rv{u_{g, d}}}_{d=1}^D$ of inducing variables.
212 212
 Note that for the convolution process $f$, we introduce one set of inducing variables $\rv{u_{f, d}}$ per output $f_d$.
213
-These inducing play a crucial role in sharing information between the different outputs.
213
+These variables play a crucial role in sharing information between the different outputs.
214 214
 
215 215
 
216 216
 \subsection{Variational Lower Bound}
@@ -264,7 +264,7 @@ Resubstituting this result into \cref{eq:svgp_log_likelihood} yields the final b
264 264
     &- \frac{1}{2\sigma_a^2} \Fun*{\tr}{\mat{K_{au}}\mat{K_{uu}}\inv\mat{S}\mat{K_{uu}}\inv\mat{K_{ua}}}.
265 265
 \end{split}
266 266
 \end{align}
267
-This bound, which depends on the hyperparameters of the kernel and likelihood $\left\{ \mat{\theta}, \sigma_a \right\}$ and the variational parameters $\left\{\mat{Z}, \mat{m}, \mat{S} \right\}$, can be calculated in $\Oh(NM^2)$ time.
267
+This bound, which depends on the hyper parameters of the kernel and likelihood $\left\{ \mat{\theta}, \sigma_a \right\}$ and the variational parameters $\left\{\mat{Z}, \mat{m}, \mat{S} \right\}$, can be calculated in $\Oh(NM^2)$ time.
268 268
 It factorizes along the data points which enables stochastic optimization.
269 269
 
270 270
 In order to obtain a bound on the full model, we apply the same techniques to the other processes.
@@ -312,7 +312,7 @@ This yields
312 312
     &+ \Moment*{\E_{\Variat{\rv{a}}}}{\log \aProb{\rv{f} \given \rv{u_f}, \rv{a}}}.
313 313
 \end{split}
314 314
 \end{align}
315
-The expectations with respect to $\Variat{\rv{a}}$ involve expectations of kernel matrices, also called psi-statistics, in the same way as in \parencites{damianou_deep_2012} and are given by
315
+The expectations with respect to $\Variat{\rv{a}}$ involve expectations of kernel matrices, also called $\Psi$-statistics, in the same way as in \parencites{damianou_deep_2012} and are given by
316 316
 \begin{align}
317 317
 \begin{split}
318 318
     \label{eq:psi_statistics}
@@ -321,8 +321,8 @@ The expectations with respect to $\Variat{\rv{a}}$ involve expectations of kerne
321 321
     \mat{\Phi_f} &= \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}. \\
322 322
 \end{split}
323 323
 \end{align}
324
-These psi-statistics can be computed analytically for multiple kernels, including the squared exponential kernel in \cref{eq:rbf_kernel}.
325
-In \cref{subsec:kernel_expectations} we show closed-form solutions for these psi-statistics for the implicit kernel defined in the CP layer.
324
+These $\Psi$-statistics can be computed analytically for multiple kernels, including the squared exponential kernel in \cref{eq:rbf_kernel}.
325
+In \cref{subsec:kernel_expectations} we show closed-form solutions for these $\Psi$-statistics for the implicit kernel defined in the CP layer.
326 326
 To obtain the final formulation of the desired bound for $\log \Prob{\rv{f} \given \rv{u_f}}$ we substitute \cref{eq:psi_statistics} into \cref{eq:f_marginal_likelihood} and get the analytically tractable bound
327 327
 \begin{align}
328 328
 \begin{split}
@@ -360,7 +360,7 @@ In order to derive a bound for $\log \Prob{\rv{y} \given \rv{u_g}}$ we apply the
360 360
 \subsection{Convolution Kernel Expectations}
361 361
 \label{subsec:kernel_expectations}
362 362
 In \cref{sec:model} we assumed the latent processes $w_r$ to be white noise processes and the smoothing kernel functions $T_{d, r}$ to be squared exponential kernels, leading to an explicit closed form formulation for the covariance between outputs shown in \cref{eq:dependent_kernel}.
363
-In this section, we derive the psi-statistics for this generalized squared exponential kernel needed to evaluate \cref{eq:full_bound}.
363
+In this section, we derive the $\Psi$-statistics for this generalized squared exponential kernel needed to evaluate \cref{eq:full_bound}.
364 364
 
365 365
 The uncertainty about the first layer is captured by the variational distribution of the latent alignments $\rv{a}$ given by $\Variat{\rv{a}} \sim \Gaussian{\mat{\mu_a}, \mat{\Sigma_a}}\text{, with } \rv{a} = \left( \rv{a_1}, \dots, \rv{a_d} \right)$.
366 366
 Every aligned point in $\rv{a}$ corresponds to one output of $\rv{f}$ and ultimately to one of the $\rv{y_i}$.
@@ -380,7 +380,7 @@ The expectation $\mat{\Psi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{fu}}}$ co
380 380
     \MoveEqLeft \exp\left(-\frac{1}{2} \frac{(\mat{\Sigma_a})_{nn}\inv\hat{\ell}_{ni}}{(\mat{\Sigma_a})_{nn}\inv + \hat{\ell}_{ni}} \left((\mat{\mu_a})_n - \mat{u_i}\right)^2\right),
381 381
 \end{split}
382 382
 \end{align}
383
-where $\hat{\ell}_{ni}$ is the combined lengthscale corresponding to the same kernel as $\hat{\sigma}_{ni}$.
383
+where $\hat{\ell}_{ni}$ is the combined length scale corresponding to the same kernel as $\hat{\sigma}_{ni}$.
384 384
 Lastly, $\mat{\Phi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}$ connects alignments and pairs of pseudo inputs with the closed form
385 385
 \begin{align}
386 386
 \begin{split}
@@ -391,14 +391,14 @@ Lastly, $\mat{\Phi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}
391 391
 \end{split}
392 392
 \end{align}
393 393
 
394
-Note that the psi-statistics factorize along the data and we only need to consider the diagonal entries of $\mat{\Sigma_a}$.
395
-If all the data belong to the same output, the psi-statistics of the standard squared exponential kernel can be recovered as a special case.
394
+Note that the $\Psi$-statistics factorize along the data and we only need to consider the diagonal entries of $\mat{\Sigma_a}$.
395
+If all the data belong to the same output, the $\Psi$-statistics of the standard squared exponential kernel can be recovered as a special case.
396 396
 It is used to propagate the uncertainties through the output-specific warpings $\rv{g}$.
397 397
 
398 398
 
399 399
 \subsection{Approximative Predictions}
400 400
 \label{subsec:predictions}
401
-Using the variational lower bound in \cref{eq:full_bound}, our model can be fitted to data, resulting in appropriate choices of the kernel hyperparameters and variational parameters.
401
+Using the variational lower bound in \cref{eq:full_bound}, our model can be fitted to data, resulting in appropriate choices of the kernel hyper parameters and variational parameters.
402 402
 Now assume we want to predict approximate function values $\mat{g_{d, \star}}$ for previously unseen points $\mat{X_{d, \star}}$ associated with output $d$, which are given by $ \mat{g_{d, \star}} = g_d(f_d(a_d(\mat{X_{d, \star}})))$.
403 403
 
404 404
 Because of the conditional independence assumptions in the model, other outputs $d^\prime \neq d$ only have to be considered in the shared layer $\rv{f}$.
@@ -507,7 +507,7 @@ We subsequently removed parts of the training sets to explore the generalization
507 507
 We use this setup to train our model using squared exponential kernels both in the conditionally independent GPs $\rv{a_i}$ and $\rv{g_i}$ and as smoothing kernels in $\rv{f}$.
508 508
 We can always choose one alignment and one warping to be the identity function in order to constrain the shared latent spaces $\rv{a}$ and $\rv{f}$ and provide a reference the other alignments and warpings will be relative to.
509 509
 Since we assume our artificial data simulates a physical system, we apply the prior knowledge that the alignment and warping processes have slower dynamics compared to the shared latent function which should capture most of the observed dynamics.
510
-To this end we applied priors to the $\rv{a_i}$ and $\rv{g_i}$ which prefer longer lengthscales and smaller variances compared to $\rv{f}$.
510
+To this end we applied priors to the $\rv{a_i}$ and $\rv{g_i}$ which prefer longer length scales and smaller variances compared to $\rv{f}$.
511 511
 Otherwise, the model could find local minima like choosing the upper two layers to be identitiy functions and model the time series independently in the $\rv{g_i}$.
512 512
 Additionally, assuming identity functions as mean functions for the different GPs prevents pathological cases in which the complete model collapses to a constant function \parencite{salimbeni_doubly_2017}.
513 513
 
@@ -518,7 +518,7 @@ An independently trained GP, as seen in \cref{fig:toy_joint}, reverts to the pri
518 518
 
519 519
 Inspecting the recovered decomposition, it can be seen that almost no uncertainty is placed in the quadratic alignment of the second time series at places where no data is available.
520 520
 Indeed, the model tends to add uncertainty as late in the hierarchy as possible.
521
-Intuitively, the model is highly incentivised to do this because in general uncertainty introduced into a prediction tends to inflate throughout the hierarchy.
521
+Intuitively, the model is highly incentivized to do this because in general uncertainty introduced into a prediction tends to inflate throughout the hierarchy.
522 522
 If the model is uncertain about the alignments, it will most likely be forced to be more uncertain about the shared function, which in the end penalizes the bound in \cref{eq:full_bound} via the fit term more than an overly confident model.
523 523
 
524 524
 
@@ -544,7 +544,7 @@ If the model is uncertain about the alignments, it will most likely be forced to
544 544
     }
545 545
     \label{fig:wind_samples}
546 546
 \end{figure}
547
-This experiment is based on data recorded from a pair of neighbored wind turbines in a wind park.
547
+This experiment is based on real data recorded from a pair of neighbored wind turbines in a wind farm.
548 548
 The two time series $\T_1$ and $\T_2$ shown in gray in \cref{fig:wind_joint_model} record the respective power generation of the two turbines over the course of one and a half hours, which was smoothed slightly using a rolling average over 60 seconds.
549 549
 There are 5400 data points for the first blue turbine and 4622 data points for the second green turbine.
550 550
 We removed two intervals (drawn as dashed lines) from the second turbine's dataset to inspect the behaviour of the model with missing data.
@@ -559,7 +559,7 @@ We apply our model to this problem in order to simultaneously learn an uncertain
559 559
 Because we do not model the local parts of the signals, they need to be explained by the Gaussian likelihood noise term.
560 560
 We use a squared exponential kernel as a prior for the alignment functions $\rv{a_i}$ and as smoothing kernels in $\rv{f}$.
561 561
 Because this dataset consists of only one dimension and we chose a situation where the two turbines do not influence each other, we can assume the output warpings $\rv{g_i}$ to be linear functions.
562
-Again we encode a preference for alignments with slow dynamics with a prior on the lengthscales of $\rv{a_i}$.
562
+Again we encode a preference for alignments with slow dynamics with a prior on the length scales of $\rv{a_i}$.
563 563
 
564 564
 Because the signal is modulated with local autoregressive components, plausible alignments are not unique.
565 565
 To constrain our model, we want it to prefer alignments close to the identity function which we chose as a prior mean function.
@@ -569,7 +569,7 @@ The model can therefore choose the posterior distribution of $\rv{u_a}$ in a way
569 569
 To prevent this, we also place a prior on the mean of $\rv{u_a}$ to remove this degree of freedom.
570 570
 
571 571
 \Cref{fig:wind_joint_model} shows the joint model learned from the data in which $a_1$ is chosen to be the identity function.
572
-The possible alignments identified match the physical conditions of the wind park.
572
+The possible alignments identified match the physical conditions of the wind farm.
573 573
 For the given turbines, time offsets of up to seven minutes are plausible and for most wind conditions, the offset is expected to be close to zero.
574 574
 In order to calculate the posteriors, the uncertainty about the correct alignment in only propagated through the shared function for $\T_2$, leading to smoother predictions.
575 575
 For areas where the alignment is quite certain however, the two time series are explained with comparable detail.
@@ -707,7 +707,7 @@ Resubstituting this result into \cref{app:eq:svgp_log_likelihood} yields the fin
707 707
     &- \frac{1}{2\sigma_a^2} \Fun*{\tr}{\mat{K_{au}}\mat{K_{uu}}\inv\mat{S}\mat{K_{uu}}\inv\mat{K_{ua}}}.
708 708
 \end{split}
709 709
 \end{align}
710
-This bound, which depends on the hyperparameters of the kernel and likelihood $\left\{ \mat{\theta}, \sigma_a \right\}$ and the variational parameters $\left\{\mat{Z}, \mat{m}, \mat{S} \right\}$, can be calculated in $\Oh(NM^2)$ time.
710
+This bound, which depends on the hyper parameters of the kernel and likelihood $\left\{ \mat{\theta}, \sigma_a \right\}$ and the variational parameters $\left\{\mat{Z}, \mat{m}, \mat{S} \right\}$, can be calculated in $\Oh(NM^2)$ time.
711 711
 It factorizes along the data points which enables stochastic optimization.
712 712
 
713 713
 In order to obtain a bound on the full model, we apply the same techniques to the other processes.
@@ -768,7 +768,7 @@ This yields
768 768
 \end{split}
769 769
 \end{align}
770 770
 where we apply Fubini's theorem to exchange the order of integration in the expected values.
771
-The expectations with respect to $\Variat{\rv{a}}$ involve expectations of kernel matrices, also called psi-statistics, in the same way as in \parencites{damianou_deep_2012} and are given by
771
+The expectations with respect to $\Variat{\rv{a}}$ involve expectations of kernel matrices, also called $\Psi$-statistics, in the same way as in \parencites{damianou_deep_2012} and are given by
772 772
 \begin{align}
773 773
 \begin{split}
774 774
     \label{app:eq:psi_statistics}
@@ -777,8 +777,8 @@ The expectations with respect to $\Variat{\rv{a}}$ involve expectations of kerne
777 777
     \mat{\Phi_f} &= \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}. \\
778 778
 \end{split}
779 779
 \end{align}
780
-These psi-statistics can be computed analytically for multiple kernels, including the squared exponential kernel in \cref{app:eq:rbf_kernel}.
781
-In \cref{app:subsec:kernel_expectations} we show closed-form solutions for these psi-statistics for the implicit kernel defined in the CP layer.
780
+These $\Psi$-statistics can be computed analytically for multiple kernels, including the squared exponential kernel in \cref{app:eq:rbf_kernel}.
781
+In \cref{app:subsec:kernel_expectations} we show closed-form solutions for these $\Psi$-statistics for the implicit kernel defined in the CP layer.
782 782
 To obtain the final formulation of the desired bound for $\log \Prob{\rv{f} \given \rv{u_f}}$ we substitute \cref{app:eq:psi_statistics} into \cref{app:eq:f_marginal_likelihood} and get the analytically tractable bound
783 783
 \begin{align}
784 784
 \begin{split}
@@ -818,7 +818,7 @@ In order to derive a bound for $\log \Prob{\rv{y} \given \rv{u_g}}$ we apply the
818 818
 \subsection{Convolution Kernel Expectations}
819 819
 \label{app:subsec:kernel_expectations}
820 820
 In \cref{sec:model} we assumed the latent processes $w_r$ to be white noise processes and the smoothing kernel functions $T_{d, r}$ to be squared exponential kernels, leading to an explicit closed form formulation for the covariance between outputs shown in \cref{app:eq:dependent_kernel}.
821
-In this section, we derive the psi-statistics for this generalized squared exponential kernel needed to evaluate \cref{app:eq:full_bound}.
821
+In this section, we derive the $\Psi$-statistics for this generalized squared exponential kernel needed to evaluate \cref{app:eq:full_bound}.
822 822
 
823 823
 The uncertainty about the first layer is captured by the variational distribution of the latent alignments $\rv{a}$ given by $\Variat{\rv{a}} \sim \Gaussian{\mat{\mu_a}, \mat{\Sigma_a}}\text{, with } \rv{a} = \left( \rv{a_1}, \dots, \rv{a_d} \right)$.
824 824
 Every aligned point in $\rv{a}$ corresponds to one output of $\rv{f}$ and ultimately to one of the $\rv{y_i}$.
@@ -846,7 +846,7 @@ The expectation $\mat{\Psi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{fu}}}$ co
846 846
     &\quad {} \cdot \exp\left(-\frac{1}{2} \frac{(\mat{\Sigma_a})_{nn}\inv\hat{\ell}_{ni}}{(\mat{\Sigma_a})_{nn}\inv + \hat{\ell}_{ni}} \left((\mat{\mu_a})_n - \mat{u_i}\right)^2\right)
847 847
 \end{split}
848 848
 \end{align}
849
-where $\hat{\ell}_{ni}$ is the combined lengthscale corresponding to the same kernel as $\hat{\sigma}_{ni}$.
849
+where $\hat{\ell}_{ni}$ is the combined length scale corresponding to the same kernel as $\hat{\sigma}_{ni}$.
850 850
 Lastly, $\mat{\Phi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}$ connects alignments and pairs of pseudo inputs with the closed form
851 851
 \begin{align}
852 852
 \begin{split}
@@ -860,8 +860,8 @@ Lastly, $\mat{\Phi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}
860 860
 \end{split}
861 861
 \end{align}
862 862
 
863
-Note that the psi-statistics factorize along the data and we only need to consider the diagonal entries of $\mat{\Sigma_a}$.
864
-If all the data belong to the same output, the psi-statistics of the standard squared exponential kernel can be recovered as a special case.
863
+Note that the $\Psi$-statistics factorize along the data and we only need to consider the diagonal entries of $\mat{\Sigma_a}$.
864
+If all the data belong to the same output, the $\Psi$-statistics of the standard squared exponential kernel can be recovered as a special case.
865 865
 It is used to propagate the uncertainties through the output-specific warpings $\rv{g}$.
866 866
 
867 867
 \end{document}

+ 32
- 32
zotero_export.bib View File

@@ -1,8 +1,8 @@
1 1
 
2 2
 @article{damianou_deep_2012,
3
-	title = {Deep Gaussian Processes},
3
+	title = {Deep {G}aussian Processes},
4 4
 	url = {http://arxiv.org/abs/1211.0358},
5
-	abstract = {In this paper we introduce deep Gaussian process ({GP}) models. Deep {GPs} are a deep belief network based on Gaussian process mappings. The data is modeled as the output of a multivariate {GP}. The inputs to that Gaussian process are then governed by another {GP}. A single layer model is equivalent to a standard {GP} or the {GP} latent variable model ({GP}-{LVM}). We perform inference in the model by approximate variational marginalization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically applied to relatively large data sets using stochastic gradient descent for optimization. Our fully Bayesian treatment allows for the application of deep models even when data is scarce. Model selection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples.},
5
+	abstract = {In this paper we introduce deep {G}aussian process ({GP}) models. Deep {GPs} are a deep belief network based on {G}aussian process mappings. The data is modeled as the output of a multivariate {GP}. The inputs to that {G}aussian process are then governed by another {GP}. A single layer model is equivalent to a standard {GP} or the {GP} latent variable model ({GP}-{LVM}). We perform inference in the model by approximate variational marginalization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically applied to relatively large data sets using stochastic gradient descent for optimization. Our fully {B}ayesian treatment allows for the application of deep models even when data is scarce. Model selection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples.},
6 6
 	journaltitle = {{arXiv}:1211.0358 [cs, math, stat]},
7 7
 	author = {Damianou, Andreas C. and Lawrence, Neil D.},
8 8
 	urldate = {2016-09-05},
@@ -10,45 +10,45 @@
10 10
 	eprinttype = {arxiv},
11 11
 	eprint = {1211.0358},
12 12
 	keywords = {60G15, 58E30, Computer Science - Learning, G.1.2, G.3, I.2.6, Mathematics - Probability, Statistics - Machine Learning},
13
-	file = {arXiv\:1211.0358 PDF:/home/markus/sync/zotero/storage/BUXWE2UV/Damianou and Lawrence - 2012 - Deep Gaussian Processes.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/S2KB72DK/1211.html:text/html}
13
+	file = {arXiv\:1211.0358 PDF:/home/markus/sync/zotero/storage/BUXWE2UV/Damianou and Lawrence - 2012 - Deep {G}aussian Processes.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/S2KB72DK/1211.html:text/html}
14 14
 }
15 15
 
16 16
 @article{hensman_gaussian_2013,
17
-	title = {Gaussian Processes for Big Data},
17
+	title = {{G}aussian Processes for Big Data},
18 18
 	url = {http://arxiv.org/abs/1309.6835},
19
-	abstract = {We introduce stochastic variational inference for Gaussian process models. This enables the application of Gaussian process ({GP}) models to data sets containing millions of data points. We show how {GPs} can be vari- ationally decomposed to depend on a set of globally relevant inducing variables which factorize the model in the necessary manner to perform variational inference. Our ap- proach is readily extended to models with non-Gaussian likelihoods and latent variable models based around Gaussian processes. We demonstrate the approach on a simple toy problem and two real world data sets.},
19
+	abstract = {We introduce stochastic variational inference for {G}aussian process models. This enables the application of {G}aussian process ({GP}) models to data sets containing millions of data points. We show how {GPs} can be vari- ationally decomposed to depend on a set of globally relevant inducing variables which factorize the model in the necessary manner to perform variational inference. Our ap- proach is readily extended to models with non-{G}aussian likelihoods and latent variable models based around {G}aussian processes. We demonstrate the approach on a simple toy problem and two real world data sets.},
20 20
 	journaltitle = {{arXiv}:1309.6835 [cs, stat]},
21 21
 	author = {Hensman, James and Fusi, Nicolo and Lawrence, Neil D.},
22 22
 	urldate = {2016-07-06},
23 23
 	date = {2013-09-26},
24 24
 	keywords = {Computer Science - Learning, Statistics - Machine Learning},
25
-	file = {arXiv\:1309.6835 PDF:/home/markus/sync/zotero/storage/EU8WZFR4/Hensman et al. - 2013 - Gaussian Processes for Big Data.pdf:application/pdf;arXiv\:1309.6835 PDF:/home/markus/sync/zotero/storage/XV3VH9PJ/Hensman et al. - 2013 - Gaussian Processes for Big Data.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/2JAR4BNM/1309.html:text/html;arXiv.org Snapshot:/home/markus/sync/zotero/storage/ISZ4Z86Q/1309.html:text/html}
25
+	file = {arXiv\:1309.6835 PDF:/home/markus/sync/zotero/storage/EU8WZFR4/Hensman et al. - 2013 - {G}aussian Processes for Big Data.pdf:application/pdf;arXiv\:1309.6835 PDF:/home/markus/sync/zotero/storage/XV3VH9PJ/Hensman et al. - 2013 - {G}aussian Processes for Big Data.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/2JAR4BNM/1309.html:text/html;arXiv.org Snapshot:/home/markus/sync/zotero/storage/ISZ4Z86Q/1309.html:text/html}
26 26
 }
27 27
 
28 28
 @inproceedings{titsias_bayesian_2010,
29
-	title = {Bayesian Gaussian process latent variable model},
29
+	title = {{B}ayesian {G}aussian process latent variable model},
30 30
 	url = {http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_TitsiasL10.pdf},
31 31
 	pages = {844--851},
32 32
 	booktitle = {International Conference on Artificial Intelligence and Statistics},
33 33
 	author = {Titsias, Michalis K. and Lawrence, Neil D.},
34 34
 	urldate = {2016-02-01},
35 35
 	date = {2010},
36
-	file = {[PDF] von wustl.edu:/home/markus/sync/zotero/storage/5HPG3ZG9/Titsias and Lawrence - 2010 - Bayesian Gaussian process latent variable model.pdf:application/pdf}
36
+	file = {[PDF] von wustl.edu:/home/markus/sync/zotero/storage/5HPG3ZG9/Titsias and Lawrence - 2010 - {B}ayesian {G}aussian process latent variable model.pdf:application/pdf}
37 37
 }
38 38
 
39 39
 @inproceedings{lazaro-gredilla_bayesian_2012,
40
-	title = {Bayesian warped Gaussian processes},
40
+	title = {{B}ayesian warped {G}aussian processes},
41 41
 	url = {http://papers.nips.cc/paper/4494-bayesian-warped-gaussian-processes},
42 42
 	pages = {1619--1627},
43 43
 	booktitle = {Advances in Neural Information Processing Systems},
44 44
 	author = {Lázaro-Gredilla, Miguel},
45 45
 	urldate = {2016-12-06},
46 46
 	date = {2012},
47
-	file = {[PDF] wustl.edu:/home/markus/sync/zotero/storage/NTS9SDBA/Lázaro-Gredilla - 2012 - Bayesian warped Gaussian processes.pdf:application/pdf;Snapshot:/home/markus/sync/zotero/storage/HFAKAI4X/4494-bayesian-warped-gaussian-processes.html:text/html}
47
+	file = {[PDF] wustl.edu:/home/markus/sync/zotero/storage/NTS9SDBA/Lázaro-Gredilla - 2012 - {B}ayesian warped {G}aussian processes.pdf:application/pdf;Snapshot:/home/markus/sync/zotero/storage/HFAKAI4X/4494-bayesian-warped-gaussian-processes.html:text/html}
48 48
 }
49 49
 
50 50
 @inproceedings{boyle_dependent_2004,
51
-	title = {Dependent Gaussian Processes.},
51
+	title = {Dependent {G}aussian Processes.},
52 52
 	volume = {17},
53 53
 	url = {https://papers.nips.cc/paper/2561-dependent-gaussian-processes.pdf},
54 54
 	pages = {217--224},
@@ -56,14 +56,14 @@
56 56
 	author = {Boyle, Phillip and Frean, Marcus R.},
57 57
 	urldate = {2017-01-27},
58 58
 	date = {2004},
59
-	file = {[PDF] nips.cc:/home/markus/sync/zotero/storage/HJT7BPIT/Boyle and Frean - 2004 - Dependent Gaussian Processes..pdf:application/pdf}
59
+	file = {[PDF] nips.cc:/home/markus/sync/zotero/storage/HJT7BPIT/Boyle and Frean - 2004 - Dependent {G}aussian Processes..pdf:application/pdf}
60 60
 }
61 61
 
62 62
 @article{alvarez_kernels_2011,
63 63
 	title = {Kernels for Vector-Valued Functions: a Review},
64 64
 	url = {http://arxiv.org/abs/1106.6251},
65 65
 	shorttitle = {Kernels for Vector-Valued Functions},
66
-	abstract = {Kernel methods are among the most popular techniques in machine learning. From a frequentist/discriminative perspective they play a central role in regularization theory as they provide a natural choice for the hypotheses space and the regularization functional through the notion of reproducing kernel Hilbert spaces. From a Bayesian/generative perspective they are the key in the context of Gaussian processes, where the kernel function is also known as the covariance function. Traditionally, kernel methods have been used in supervised learning problem with scalar outputs and indeed there has been a considerable amount of work devoted to designing and learning kernels. More recently there has been an increasing interest in methods that deal with multiple outputs, motivated partly by frameworks like multitask learning. In this paper, we review different methods to design or learn valid kernel functions for multiple outputs, paying particular attention to the connection between probabilistic and functional methods.},
66
+	abstract = {Kernel methods are among the most popular techniques in machine learning. From a frequentist/discriminative perspective they play a central role in regularization theory as they provide a natural choice for the hypotheses space and the regularization functional through the notion of reproducing kernel Hilbert spaces. From a {B}ayesian/generative perspective they are the key in the context of {G}aussian processes, where the kernel function is also known as the covariance function. Traditionally, kernel methods have been used in supervised learning problem with scalar outputs and indeed there has been a considerable amount of work devoted to designing and learning kernels. More recently there has been an increasing interest in methods that deal with multiple outputs, motivated partly by frameworks like multitask learning. In this paper, we review different methods to design or learn valid kernel functions for multiple outputs, paying particular attention to the connection between probabilistic and functional methods.},
67 67
 	journaltitle = {{arXiv}:1106.6251 [cs, math, stat]},
68 68
 	author = {Alvarez, Mauricio A. and Rosasco, Lorenzo and Lawrence, Neil D.},
69 69
 	urldate = {2017-02-06},
@@ -75,9 +75,9 @@
75 75
 }
76 76
 
77 77
 @article{hensman_scalable_2014,
78
-	title = {Scalable Variational Gaussian Process Classification},
78
+	title = {Scalable Variational {G}aussian Process Classification},
79 79
 	url = {http://arxiv.org/abs/1411.2005},
80
-	abstract = {Gaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, outperforming the state of the art on benchmark datasets. Importantly, the variational formulation can be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.},
80
+	abstract = {{G}aussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, outperforming the state of the art on benchmark datasets. Importantly, the variational formulation can be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments.},
81 81
 	journaltitle = {{arXiv}:1411.2005 [stat]},
82 82
 	author = {Hensman, James and Matthews, Alex and Ghahramani, Zoubin},
83 83
 	urldate = {2017-02-13},
@@ -85,19 +85,19 @@
85 85
 	eprinttype = {arxiv},
86 86
 	eprint = {1411.2005},
87 87
 	keywords = {Statistics - Machine Learning},
88
-	file = {arXiv\:1411.2005 PDF:/home/markus/sync/zotero/storage/T4WFAQPK/Hensman et al. - 2014 - Scalable Variational Gaussian Process Classificati.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/5GEKF8R7/1411.html:text/html}
88
+	file = {arXiv\:1411.2005 PDF:/home/markus/sync/zotero/storage/T4WFAQPK/Hensman et al. - 2014 - Scalable Variational {G}aussian Process Classificati.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/5GEKF8R7/1411.html:text/html}
89 89
 }
90 90
 
91 91
 @report{boyle_multiple_2005,
92 92
 	title = {Multiple output gaussian process regression},
93
-	abstract = {Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1},
93
+	abstract = {{G}aussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat {G}aussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend {G}aussian processes to handle multiple, coupled outputs. 1},
94 94
 	author = {Boyle, Phillip and Frean, Marcus and Boyle, Phillip and Frean, Marcus},
95 95
 	date = {2005},
96 96
 	file = {Citeseer - Full Text PDF:/home/markus/sync/zotero/storage/STU7NV59/Boyle et al. - 2005 - Multiple output gaussian process regression.pdf:application/pdf;Citeseer - Snapshot:/home/markus/sync/zotero/storage/ZWMMCM3F/summary.html:text/html}
97 97
 }
98 98
 
99 99
 @inproceedings{alvarez_efficient_2010,
100
-	title = {Efficient Multioutput Gaussian Processes through Variational Inducing Kernels.},
100
+	title = {Efficient Multioutput {G}aussian Processes through Variational Inducing Kernels.},
101 101
 	volume = {9},
102 102
 	url = {http://www.jmlr.org/proceedings/papers/v9/alvarez10a/alvarez10a.pdf},
103 103
 	pages = {25--32},
@@ -105,11 +105,11 @@
105 105
 	author = {Alvarez, Mauricio A. and Luengo, David and Titsias, Michalis K. and Lawrence, Neil D.},
106 106
 	urldate = {2017-03-02},
107 107
 	date = {2010},
108
-	file = {[PDF] jmlr.org:/home/markus/sync/zotero/storage/6Q4I9FRF/Alvarez et al. - 2010 - Efficient Multioutput Gaussian Processes through V.pdf:application/pdf}
108
+	file = {[PDF] jmlr.org:/home/markus/sync/zotero/storage/6Q4I9FRF/Alvarez et al. - 2010 - Efficient Multioutput {G}aussian Processes through V.pdf:application/pdf}
109 109
 }
110 110
 
111 111
 @inproceedings{titsias_variational_2009,
112
-	title = {Variational Learning of Inducing Variables in Sparse Gaussian Processes.},
112
+	title = {Variational Learning of Inducing Variables in Sparse {G}aussian Processes.},
113 113
 	volume = {5},
114 114
 	url = {http://www.jmlr.org/proceedings/papers/v5/titsias09a/titsias09a.pdf},
115 115
 	pages = {567--574},
@@ -121,9 +121,9 @@
121 121
 }
122 122
 
123 123
 @article{salimbeni_doubly_2017,
124
-	title = {Doubly Stochastic Variational Inference for Deep Gaussian Processes},
124
+	title = {Doubly Stochastic Variational Inference for Deep {G}aussian Processes},
125 125
 	url = {http://arxiv.org/abs/1705.08933},
126
-	abstract = {Gaussian processes ({GPs}) are a good choice for function approximation as they are flexible, robust to over-fitting, and provide well-calibrated predictive uncertainty. Deep Gaussian processes ({DGPs}) are multi-layer generalisations of {GPs}, but inference in these models has proved challenging. Existing approaches to inference in {DGP} models assume approximate posteriors that force independence between the layers, and do not work well in practice. We present a doubly stochastic variational inference algorithm, which does not force independence between layers. With our method of inference we demonstrate that a {DGP} model can be used effectively on data ranging in size from hundreds to a billion points. We provide strong empirical evidence that our inference scheme for {DGPs} works well in practice in both classification and regression.},
126
+	abstract = {{G}aussian processes ({GPs}) are a good choice for function approximation as they are flexible, robust to over-fitting, and provide well-calibrated predictive uncertainty. Deep {G}aussian processes ({DGPs}) are multi-layer generalisations of {GPs}, but inference in these models has proved challenging. Existing approaches to inference in {DGP} models assume approximate posteriors that force independence between the layers, and do not work well in practice. We present a doubly stochastic variational inference algorithm, which does not force independence between layers. With our method of inference we demonstrate that a {DGP} model can be used effectively on data ranging in size from hundreds to a billion points. We provide strong empirical evidence that our inference scheme for {DGPs} works well in practice in both classification and regression.},
127 127
 	journaltitle = {{arXiv}:1705.08933 [stat]},
128 128
 	author = {Salimbeni, Hugh and Deisenroth, Marc},
129 129
 	urldate = {2017-06-02},
@@ -135,20 +135,20 @@
135 135
 }
136 136
 
137 137
 @inproceedings{alvarez_sparse_2009,
138
-	title = {Sparse convolved Gaussian processes for multi-output regression},
138
+	title = {Sparse convolved {G}aussian processes for multi-output regression},
139 139
 	url = {http://papers.nips.cc/paper/3553-sparse-convolved-gaussian-processes-for-multi-output-regression},
140 140
 	pages = {57--64},
141 141
 	booktitle = {Advances in neural information processing systems},
142 142
 	author = {Alvarez, Mauricio and Lawrence, Neil D.},
143 143
 	urldate = {2017-07-14},
144 144
 	date = {2009},
145
-	file = {[PDF] nips.cc:/home/markus/sync/zotero/storage/SIZMYY5F/Alvarez and Lawrence - 2009 - Sparse convolved Gaussian processes for multi-outp.pdf:application/pdf;Snapshot:/home/markus/sync/zotero/storage/A2QU9XT7/3553-sparse-convolved-gaussian-processes-for-multi-output-regression.html:text/html}
145
+	file = {[PDF] nips.cc:/home/markus/sync/zotero/storage/SIZMYY5F/Alvarez and Lawrence - 2009 - Sparse convolved {G}aussian processes for multi-outp.pdf:application/pdf;Snapshot:/home/markus/sync/zotero/storage/A2QU9XT7/3553-sparse-convolved-gaussian-processes-for-multi-output-regression.html:text/html}
146 146
 }
147 147
 
148 148
 @article{hensman_nested_2014,
149
-	title = {Nested Variational Compression in Deep Gaussian Processes},
149
+	title = {Nested Variational Compression in Deep {G}aussian Processes},
150 150
 	url = {http://arxiv.org/abs/1412.1370},
151
-	abstract = {Deep Gaussian processes provide a flexible approach to probabilistic modelling of data using either supervised or unsupervised learning. For tractable inference approximations to the marginal likelihood of the model must be made. The original approach to approximate inference in these models used variational compression to allow for approximate variational marginalization of the hidden variables leading to a lower bound on the marginal likelihood of the model [Damianou and Lawrence, 2013]. In this paper we extend this idea with a nested variational compression. The resulting lower bound on the likelihood can be easily parallelized or adapted for stochastic variational inference.},
151
+	abstract = {Deep {G}aussian processes provide a flexible approach to probabilistic modelling of data using either supervised or unsupervised learning. For tractable inference approximations to the marginal likelihood of the model must be made. The original approach to approximate inference in these models used variational compression to allow for approximate variational marginalization of the hidden variables leading to a lower bound on the marginal likelihood of the model [Damianou and Lawrence, 2013]. In this paper we extend this idea with a nested variational compression. The resulting lower bound on the likelihood can be easily parallelized or adapted for stochastic variational inference.},
152 152
 	journaltitle = {{arXiv}:1412.1370 [stat]},
153 153
 	author = {Hensman, James and Lawrence, Neil D.},
154 154
 	urldate = {2017-07-19},
@@ -156,13 +156,13 @@
156 156
 	eprinttype = {arxiv},
157 157
 	eprint = {1412.1370},
158 158
 	keywords = {Statistics - Machine Learning},
159
-	file = {arXiv\:1412.1370 PDF:/home/markus/sync/zotero/storage/ZKNA6NYN/Hensman and Lawrence - 2014 - Nested Variational Compression in Deep Gaussian Pr.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/UMQ96R94/1412.html:text/html}
159
+	file = {arXiv\:1412.1370 PDF:/home/markus/sync/zotero/storage/ZKNA6NYN/Hensman and Lawrence - 2014 - Nested Variational Compression in Deep {G}aussian Pr.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/UMQ96R94/1412.html:text/html}
160 160
 }
161 161
 
162 162
 @article{snoek_input_2014,
163
-	title = {Input Warping for Bayesian Optimization of Non-stationary Functions},
163
+	title = {Input Warping for {B}ayesian Optimization of Non-stationary Functions},
164 164
 	url = {http://arxiv.org/abs/1402.0929},
165
-	abstract = {Bayesian optimization has proven to be a highly effective methodology for the global optimization of unknown, expensive and multimodal functions. The ability to accurately model distributions over functions is critical to the effectiveness of Bayesian optimization. Although Gaussian processes provide a flexible prior over functions which can be queried efficiently, there are various classes of functions that remain difficult to model. One of the most frequently occurring of these is the class of non-stationary functions. The optimization of the hyperparameters of machine learning algorithms is a problem domain in which parameters are often manually transformed a priori, for example by optimizing in "log-space," to mitigate the effects of spatially-varying length scale. We develop a methodology for automatically learning a wide family of bijective transformations or warpings of the input space using the Beta cumulative distribution function. We further extend the warping framework to multi-task Bayesian optimization so that multiple tasks can be warped into a jointly stationary space. On a set of challenging benchmark optimization tasks, we observe that the inclusion of warping greatly improves on the state-of-the-art, producing better results faster and more reliably.},
165
+	abstract = {{B}ayesian optimization has proven to be a highly effective methodology for the global optimization of unknown, expensive and multimodal functions. The ability to accurately model distributions over functions is critical to the effectiveness of {B}ayesian optimization. Although {G}aussian processes provide a flexible prior over functions which can be queried efficiently, there are various classes of functions that remain difficult to model. One of the most frequently occurring of these is the class of non-stationary functions. The optimization of the hyperparameters of machine learning algorithms is a problem domain in which parameters are often manually transformed a priori, for example by optimizing in "log-space," to mitigate the effects of spatially-varying length scale. We develop a methodology for automatically learning a wide family of bijective transformations or warpings of the input space using the Beta cumulative distribution function. We further extend the warping framework to multi-task {B}ayesian optimization so that multiple tasks can be warped into a jointly stationary space. On a set of challenging benchmark optimization tasks, we observe that the inclusion of warping greatly improves on the state-of-the-art, producing better results faster and more reliably.},
166 166
 	journaltitle = {{arXiv}:1402.0929 [cs, stat]},
167 167
 	author = {Snoek, Jasper and Swersky, Kevin and Zemel, Richard S. and Adams, Ryan P.},
168 168
 	urldate = {2017-07-31},
@@ -170,11 +170,11 @@
170 170
 	eprinttype = {arxiv},
171 171
 	eprint = {1402.0929},
172 172
 	keywords = {Computer Science - Learning, Statistics - Machine Learning},
173
-	file = {arXiv\:1402.0929 PDF:/home/markus/sync/zotero/storage/ELZ9CMFF/Snoek et al. - 2014 - Input Warping for Bayesian Optimization of Non-sta.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/5ECB3EET/1402.html:text/html}
173
+	file = {arXiv\:1402.0929 PDF:/home/markus/sync/zotero/storage/ELZ9CMFF/Snoek et al. - 2014 - Input Warping for {B}ayesian Optimization of Non-sta.pdf:application/pdf;arXiv.org Snapshot:/home/markus/sync/zotero/storage/5ECB3EET/1402.html:text/html}
174 174
 }
175 175
 
176 176
 @article{matthews_gpflow:_2017,
177
-	title = {{GPflow}: A Gaussian process library using {TensorFlow}},
177
+	title = {{GPflow}: A {G}aussian process library using {TensorFlow}},
178 178
 	volume = {18},
179 179
 	url = {http://www.jmlr.org/papers/volume18/16-537/16-537.pdf},
180 180
 	shorttitle = {{GPflow}},
@@ -184,5 +184,5 @@
184 184
 	author = {Matthews, Alexander G. de G. and van der Wilk, Mark and Nickson, Tom and Fujii, Keisuke and Boukouvalas, Alexis and León-Villagrá, Pablo and Ghahramani, Zoubin and Hensman, James},
185 185
 	urldate = {2017-09-27},
186 186
 	date = {2017},
187
-	file = {Full Text:/home/markus/sync/zotero/storage/X6QGAFR8/Matthews et al. - 2017 - GPflow A Gaussian process library using TensorFlo.pdf:application/pdf}
187
+	file = {Full Text:/home/markus/sync/zotero/storage/X6QGAFR8/Matthews et al. - 2017 - GPflow A {G}aussian process library using TensorFlo.pdf:application/pdf}
188 188
 }

Loading…
Cancel
Save