
@@ 31,9 +31,9 @@

31

31


32

32

\begin{icmlauthorlist}

33

33

\icmlauthor{Markus Kaiser}{tum,siemens}

34


\icmlauthor{Carl Henrik Ek}{bristol}

35

34

\icmlauthor{Clemens Otte}{siemens}

36

35

\icmlauthor{Thomas Runkler}{tum,siemens}


36

+\icmlauthor{Carl Henrik Ek}{bristol}

37

37

\end{icmlauthorlist}

38

38


39

39

\icmlaffiliation{tum}{Department of Computer Science, Technical University of Munich, Germany}


@@ 52,20 +52,25 @@

52

52

\printAffiliationsAndNotice{}

53

53


54

54

\begin{abstract}

55


 In multitask learning, the convolution process formalism is used to define covariances between multiple outputs of a Gaussian process (GP) and thereby share knowledge between multiple functions.

56


 We present a Bayesian extension to convolution processes which embeds this sharing into a latent space, allowing for both arbitrary relative alignments of the inputs and outputspecific warpings transforming the observations.

57


 This gives rise to multiple deep GP models connected via latent generating processes.

58


 We derive an efficient variational approximation based on nested variational compression \cite{hensman_nested_2014} and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.


55

+ We present a Bayesian extension to convolution processes which defines a representation between multiple functions by an embedding in a shared latent space. The proposed model allows for both arbitrary alignments of the inputs and and also nonparametric output warpings to transform the observations. This gives rise to multiple deep Gaussian process models connected via latent generating processes. We derive an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.

59

56

\end{abstract}

60

57


61

58


62

59

\section{Introduction}

63


\todo[inline]{Introduction}


60

+Gaussian processes (GP) \cite{Rasmussen:2005te} are flexible yet informative priors that allows to provide structure over the space of functions allowing us to learn from small amounts of data. Due to the infinte support of the Gaussian distribution a GP is a general function approximator. However for many types of problems it is challenging to use all our existing knowledge when considering only a single function as our knowledge is often comprised of information related to a single level of an hierarchical or a composite function. To still be able to use the benefits of a GP the priors can be composed hierarchically in what is known as a \citetitle{damianou_deep_2012} \cite{damianou_deep_2012}. Importantly a hierarchical model does not provide additional representational power, it rather the opposite, with each addition of layers it allows us to specify an ever more restrictive priors providing additional structure on our solution space allowing us to learn from smaller amounts of data.


61

+


62

+


63

+


64

+In a traditional GP setting the outputs are considered conditionally independent given the input which significantly reduces the computational cost. In many scenarios we want to be able to directly parametrise the interdependencies between the dimensions. Specific modelling of these allows us to infer dimensions when only partial observations exists. One approach is to consider a linear dependency \cite{alvarez_kernels_2011} while a more general approach would be to consider full convolutions as in \cite{boyle_dependent_2004}.


65

+


66

+The fundamental assumption underlying a GP is the assumption that the instansiations of the function is jointly Gaussian. Due to the marginalisation property of a Gaussian on one end this is very beneficial as it leads to the simple and elegant inference associated with the models but it can also be a challenge as it is a rather restrictive setting. One approach to proceed is to first map, or \emph{warp} the data closer to the Gaussian assumption which is referred to as \citetitle{NIPS2003_2481} \cite{NIPS2003_2481,lazarogredilla_bayesian_2012}.


67

+


68

+In this paper we marry the benefits of all these approaches and propose a hierarchical, warped and multioutput Gaussian process. We derive an efficient learning scheme by an approximation to the marginal likelihood which allows us to fully exploit the regularisation provided by our structure. The model we propose is highly interpretable, able to learn from small amounts of data and generalises to a large range of different problem settings.

64

69


65

70


66

71

\section{Model Definition}

67

72

\label{sec:model}

68


In this paper we are interested in formulating shared priors over multiple functions $\Set{f_d}_{d=1}^D$ using Gaussian processes (GPs).


73

+We are interested in formulating shared priors over multiple functions $\Set{f_d}_{d=1}^D$ using Gaussian processes (GPs).

69

74

There are multiple ways of formulating such priors besides assuming independence of the different $f_d$ and placing a separate GP prior on each of them.

70

75

In a common approach known as the \emph{linear model of coregionalization (LMC)} \cite{alvarez_kernels_2011} the different functions are assumed to be linear combinations of one or more \emph{latent process} $\Set{w_r}_{r=1}^R$ on which independent GP priors are placed.

71

76

Different functions in samples drawn from such models behave very similarly since for every $\mat{x}$, the $f_d$ are given by linear combinations of the (latent) point observations $\Set{w_r(\mat{x})}_{r=1}^R$,
