You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

534 lines
39 KiB

% We use precompiled images and do not add tikz for speed of compilation.
% Show overfull boxes
% \overfullrule=5pt
% We set this for hyperref
\title{Dynamic Multi-Modal Deep Gaussian Processes}
\author{\href{}{Markus Kaiser}}
\aistatstitle{Dynamic Multi-Modal Deep Gaussian Processes}
\aistatsauthor{ Markus Kaiser \And Clemens Otte \And Thomas Runkler \And Carl Henrik Ek }
\aistatsaddress{ Technical University of Munich \\ Siemens AG \And Siemens AG \And Technical University of Munich \\ Siemens AG \And University of Bristol }]
Praedicimus ergo sumus.
\section{The Dynamic Multi-Modal Gaussian Process Model}
Our proposed graphical model.
The observations $(\mat{x_n}, \mat{y_n})$ are generated using exactly one of $K$ latent function values $\rv{f_n^{\pix{k}}}$ and corresponding likelihood $\mat{y_n^{\pix{k}}}$.
The assignment $\mat{a_n}$ to a latent function is driven by input-dependent weights $\mat{\alpha_n^{\pix{k}}}$ which encode the relative relevance of the different modes at $\mat{x_n}$.
The different parts of the model are determined by the yellow hyperparameters and blue hyperparameters.
The dynamic multi-modal Gaussian Process (DMGP) model assumes that there exist $K$ independent functions $\Set*{f^{\pix{k}}}_{k=1}^K$ (or \emph{modes}) which generate the pairs of observations $\D = \Set*{(\mat{x_n}, \mat{y_n})}_{n=1}^N$ in a data set.
Every data point is generated by evaluating one of the $K$ latent functions and adding Gaussian noise from a corresponding likelihood.
The assignment of the $\nth{n}$ data point to one of the modes is specified by the indicator vector $\mat{a_n} \in \Set*{0, 1}^K$ which has exactly one non-zero entry.
Our goal is to formulate simultaneous Bayesian inference on the functions $f^{\pix{k}}$ and the assignments $\mat{a_n}$.
For notational conciseness, we collect all inputs in the $N \times D_{\text{in}}$ matrix $\mat{X} = \left(\mat{x_1}, \ldots, \mat{x_N}\right)$ and all outputs in the $N \times D$ matrix $\mat{Y} = \left(\mat{y_1}, \ldots, \mat{y_N}\right)$.
We further denote the evaluation of the $\nth{k}$ latent function associated with the $\nth{n}$ data point as $\rv{f_n^{\pix{k}}} = \Fun{f^{\pix{k}}}{\mat{x_n}}$ and collect them analogously as $\mat{F^{\pix{k}}} = \left( \rv{f_1^{\pix{k}}}, \ldots, \rv{f_N^{\pix{k}}} \right)$ and $\mat{F} = \left( \mat{F^{\pix{1}}}, \ldots, \mat{F^{\pix{K}}} \right)$.
For consistency, we refer to the $\nth{k}$ entry in $\mat{a_n}$ as $a_n^{\pix{k}}$ and also collect these values as $\mat{A} = \left(\mat{a_1}, \ldots, \mat{a_N}\right)$.
Given the notation above, the marginal likelihood of the DMGP can be separated in the likelihood, the latent function processes and the assignment process and is given by
\Prob*{\mat{Y} \given \mat{X}} &= \\
\Prob*{\mat{Y} \given \mat{F}, \mat{A}}
\Prob*{\mat{F} \given \mat{X}}
\Prob*{\mat{A} \given \mat{X}}
\diff \mat{A} \diff \mat{F}\text{,} \\
\Prob*{\mat{Y} \given \mat{F}, \mat{A}} &= \prod_{k=1}^K \Gaussian*{\mat{Y} \given \given \mat{F^{\pix{k}}}, \left(\sigma^{\pix{k}}\right)^2\Eye}^{\mat{A^{\pix{k}}}},
where $\sigma^{\pix{k}}$ is the noise level of the $\nth{k}$ Gaussian likelihood.
Since we assume the $K$ modes to be independent given the data and assignments, we place independent GP priors on the latent functions,
\Prob*{\mat{F} \given \mat{X}} &= \prod_{k=1}^K \Gaussian*{\mat{F^{\pix{k}}} \given \Fun*{\mu^{\pix{k}}}{\mat{X}}, \Fun*{k^{\pix{k}}}{\mat{X}, \mat{X}}}.
Our prior on the assignment process is composite.
First, we assume that the $\mat{a_n}$ are drawn independently from multinomial distributions with logit parameters $\mat{\alpha_n} = \left( \alpha_n^{\pix{1}}, \ldots, \alpha_n^{\pix{K}} \right)$.
A usual approach to specifying the $\mat{\alpha_n}$ is to assume them to be known a priori and to be equal for all data points~\parencite{lazaro-gredilla_overlapping_2011}.
Instead, we want to infer them from the data.
Additionally, we assume they change dependent on the position in the input space.
This allows us to efficiently model data which, for example, is unimodal in some parts of the input space and bimodal in others.
We place independent GP priors on the $\mat{\alpha^{\pix{k}}}$ with
\Prob*{\mat{\alpha} \given \mat{X}} = \prod_{k=1}^K \Gaussian*{\rv{\alpha^{\pix{k}}} \given \mat{0}, \Fun{k_\alpha^{\pix{k}}}{\mat{X}, \mat{X}}}.
The prior on the assignments $\mat{A}$ is given by marginalizing the $\mat{\alpha^{\pix{k}}}$ which when normalized parametrize a batch of multinomial distributions
\Prob*{\mat{A} \given \mat{X}} &=
\Multinomial*{\mat{A} \given \Fun{\softmax}{\mat{\alpha}}} \Prob*{\mat{\alpha} \given \mat{X}}
\diff \rv{\alpha}.
Since the GPs of the $\mat{\alpha^{\pix{k}}}$ use a zero mean function, our prior assumption is a uniform distribution between the different modes everywhere in the input space.
If inference on the $\mat{a_n}$ reveals that, say, all data points at similar positions in the input space can be explained by the same $\nth{k}$ mode, the belief about $\mat{\alpha}$ can be adjusted to make a non-uniform mode distribution favourable at this position, thereby increasing the likelihood via $\Prob*{\mat{A} \given \mat{X}}$.
This mechanism introduces an incentive for the model to use as few modes as possible to explain the data and allows us to predict a relative importance of the modes when calculating the posterior of new observations $\mat{x^\ast}$.
\todo{Compare to other models?}
\Cref{fig:dynamic_graphical_model} shows the resulting graphical model which divides the generative process for every data point in the application of the latent functions on the left side and the assignment process on the right side.
Interdependencies between the data points are introduced through the Gaussian process priors on $\rv{f_n^{\pix{k}}}$ and $\rv{\alpha_n^{\pix{k}}}$ and depend on the hyperparameters $\mat{\theta} = \Set*{\mat{\theta^{\pix{k}}}, \mat{\theta_\alpha^{\pix{k}}}, \sigma^{\pix{k}}}_{k=1}^K$.
The priors for the $f^{\pix{k}}$ can be chosen independently to encode different prior assumptions about the modes.
In \cref{subsec:choicenet} we use different kernels separate a nonlinear signal from a noise process.
Going further, we can also use deep Gaussian processes as priors for the $f^{\pix{k}}$ \parencite{damianou_deep_2012, salimbeni_doubly_2017}.
Since many real word systems are inherently hierarchical, prior knowledge can often more easily be formulated using composite functions \parencite{kaiser_bayesian_2017}.
\section{Variational approximation}
Exact inference is intractable in this model.
Instead, we now formulate a variational approximation following ideas from~\parencite{hensman_gaussian_2013, salimbeni_doubly_2017}.
Because of the rich structure in our model, finding a variational lower bound which is both faithful and can be evaluated analytically is hard.
Instead, we formulate an approximation which factorizes along both the $K$ modes and $N$ data points.
This bound can be sampled efficiently and allows us to optimize both the models for the different modes $\Set*{f^{\pix{k}}}_{k=1}^K$ and our belief about the data assignments $\Set*{\mat{a_n}}_{n=1}^N$ simultaneously using stochastic optimization methods.
\subsection{Variational Lower Bound}
As first introduced by \textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.
We collect them as $\mat{Z} = \Set*{\mat{Z^{\pix{k}}}, \mat{Z_\alpha^{\pix{k}}}}_{k=1}^K$ and $\mat{U} = \Set*{\mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K$.
Taking the function $f^{\pix{k}}$ and its corresponding GP as an example, the inducing variables $\mat{u^{\pix{k}}}$ are jointly Gaussian with the latent function values $\mat{F^{\pix{k}}}$ of the observed data by the definition of GPs.
We follow \parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.
This formulation introduces the set $\Set*{\mat{Z^{\pix{k}}}, \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$ of variational parameters indicated in~\cref{fig:dynamic_graphical_model}.
To simplify notation we will drop the dependency on the inducing inputs $\mat{Z}$ in the following.
A central assumption of this approximation is that given enough well-placed inducing variables $\mat{u^{\pix{k}}}$, they are a sufficient statistic for the latent function values $\mat{F^{\pix{k}}}$.
This implies conditional independence of the $\mat{f_n^{\pix{k}}}$ given $\mat{u^{\pix{k}}}$ and $\mat{X}$.
With this assumption, the variational posterior of a single GP can be written as
\Variat*{\mat{F^{\pix{k}}} \given \mat{X}}
\int \Variat*{\mat{u^{\pix{k}}}}
\Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}}
\diff \mat{u^{\pix{k}}}
\int \Variat*{\mat{u^{\pix{k}}}}
\prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{x_n}}
\diff \mat{u^{\pix{k}}},
which can be evaluated analytically since it is a convolution of Gaussians.
This formulation simplifies inference within single Gaussian processes.
Next, we discuss how to handle correlations between the different modes and the assignment processes.
Given a set of assignments $\mat{A}$, this factorization along the data points is preserved in our model due to the assumed independence of the different modes in~\cref{eq:true_marginal_likelihood}.
The independence is lost if the assignments are unknown.
In this case, both the (a priori independent) assignment processes and the modes influence each other through data with unclear assignments.
Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by \textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.
That is, our variational posterior takes the factorized form
\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}
= &\Variat*{\mat{\alpha}, \Set*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K} \\
= &\prod_{k=1}^K\prod_{n=1}^N \Prob*{\mat{\alpha_n^{\pix{k}}} \given \mat{u_\alpha^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u_\alpha^{\pix{k}}}} \\
&\prod_{k=1}^K \prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u^{\pix{k}}}}.
Our goal is to recover a posterior for both the different modes and the data assignment.
To achieve this, instead of marginalizing $\mat{A}$, we consider the variational joint of $\mat{Y}$ and $\mat{A}$,
\MoveEqLeft\Variat*{\mat{Y}, \mat{A}} = \\
\Prob*{\mat{Y} \given \mat{F}, \mat{A}}
\Prob*{\mat{A} \given \mat{\alpha}}
\diff \mat{F} \diff \mat{\alpha},
which retains both the Gaussian likelihood of $\mat{Y}$ and the multinomial likelihood of $\mat{A}$ shown in \cref{eq:multinomial_likelihood}.
A lower bound $\Ell_{\text{DMGP}}$ for the log-joint $\log\Prob*{\mat{Y}, \mat{A} \given \mat{X}}$ of the DMGP is given by
\Ell_{\text{DMGP}} &= \Moment*{\E_{\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}}}{\log\frac{\Prob*{\mat{Y}, \mat{A}, \mat{F}, \mat{\alpha}, \mat{U} \given \mat{X}}}{\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}}} \\
&= \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{f_n}}}}{\log \Prob*{\mat{y_n} \given \mat{f_n}, \mat{a_n}}} \\
&\quad + \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{\alpha_n}}}}{\log \Prob*{\mat{a_n} \given \mat{\alpha_n}}} \\
&\quad - \sum_{k=1}^K \KL{\Variat*{\mat{u^{\pix{k}}}}}{\Prob*{\mat{u^{\pix{k}}} \given \mat{Z^{\pix{k}}}}} \\
&\quad - \sum_{k=1}^K \KL{\Variat*{\mat{u_\alpha^{\pix{k}}}}}{\Prob*{\mat{u_\alpha^{\pix{k}}} \given \mat{Z_\alpha^{\pix{k}}}}}.
Due to the structure of~\cref{eq:variational_distribution}, the bound factorizes along the data, enabling stochastic optimization.
This bound has complexity $\Fun*{\Oh}{NM^2K}$ to evaluate.
\subsection{Optimization of the Lower Bound}
An important property of the variational bound for DSVI described in \parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.
Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations \parencite{kingma_variational_2015,rezende_stochastic_2014}.
While it would not be necessary to sample from the different modes since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically \parencite{hensman_gaussian_2013}, we apply the idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.
For $\mat{\alpha}$, the analytical propagation of uncertainties through the $\softmax$-renormalization and multinomial likelihoods is intractable but can easily be evaluated using sampling.
We optimize $\Ell_{\text{DMGP}}$ to simultaneously recover maximum likelihood estimates of the hyperparameters $\mat{\theta}$, the variational parameters $\Set*{\mat{Z}, \mat{m}, \mat{S}}$, and assignments $\mat{A}$.
For every $n$, we represent the belief about $\mat{a_n}$ as a $K$-dimensional discrete distribution $\Variat*{\mat{a_n}}$.
This distribution models the result of drawing a sample from $\Multinomial*{\mat{a_n} \given \Fun{\softmax}{\mat{\alpha_n}}}$ during the generation of the data point $(\mat{x_n}, \mat{y_n})$.
Since we want to optimize $\Ell_{\text{DMGP}}$ using (stochastic) gradient descent, we need to employ a continuous relaxation to gain informative gradients of the bound with respect to the binary (and discrete) vectors $\mat{a_n}$.
One straightforward way to relax the problem is to use the current belief $\Variat*{\mat{a_n}}$ as parameters for a convex combination of the $\mat{f_n^{\pix{k}}}$, that is, to approximate $\mat{f_n} \approx \sum_{k=1}^K \Variat*{\mat{a_n^{\pix{k}}}}\mat{\hat{f}_n^{\pix{k}}}$.
Using this relaxation causes multiple problems in practice.
Most importantly, explaining data points as mixtures of modes can substantially simplify the learning problem while violating the modelling assumption that every data point was generated using exactly one mode.
Because of this, special care must be taken during optimization to enforce the sparsity of $\Variat*{\mat{a_n}}$.
To avoid this problem, we propose using a different relaxation based on additional stochasticity.
Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by \textcite{maddison_concrete_2016} whose parameters are given by $\Variat*{\mat{a_n}}$.
Based on a temperature parameter $\lambda$, a Concrete random variable enforces sparsity but is also continuous and yields informative gradients using automatic differentiation.
Samples from a Concrete random variable are unit vectors and for $\lambda \to 0$ their distribution approaches a discrete distribution.
Our approximate evaluation of the bound in \cref{eq:variational_bound} during optimization has multiple sources of stochasticity, all of which are unbiased.
First, we approximate the expectations using Monte Carlo samples $\mat{\hat{f}_n^{\pix{k}}}$, $\mat{\hat{\alpha}_n^{\pix{k}}}$, and $\mat{\hat{a}_n}$.
And second, the factorization of the bound along the data allows us to use mini-batches for optimization \parencite{salimbeni_doubly_2017, hensman_gaussian_2013}.
\subsection{Approximate Predictions}
Predictions for a test location $\mat{x_\ast}$ are mixtures of $K$ independent Gaussians, given by
\Variat*{\mat{f_\ast} \given \mat{x_\ast}}
&= \int \sum_{k=1}^K \Variat*{a_\ast^{\pix{k}} \given \mat{x_\ast}} \Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}} \diff \mat{a_\ast^{\pix{k}}}\\
&\approx \sum_{k=1}^K \hat{a}_\ast^{\pix{k}} \mat{\hat{f}_\ast^{\pix{k}}}.
The predictive posteriors of the $K$ modes $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically \parencite{hensman_gaussian_2013}.
Samples from the predictive density over $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ can be obtained by sampling from the Gaussian process posteriors $\Variat*{\mat{\alpha_\ast^{\pix{k}}} \given \mat{x_\ast}}$ and renormalizing the resulting vector $\mat{\alpha_\ast}$ using the $\softmax$-function.
The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model's belief about how many and which of the $K$ modes are relevant at the test location $\mat{x_\ast}$.
\subsection{Extension to deep Gaussian processes}
As long as their variational bound can be efficiently sampled, any model can be used in place of shallow GPs for the $f^{\pix{k}}$.
Since our approximation is based on DSVI for deep Gaussian processes, an extension to DGPs is straightforward.
Analogously to \parencite{salimbeni_doubly_2017}, our new prior assumption about the $\nth{k}$ latent function values $\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}}$ is given by
\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}} = \prod_{l=1}^L \Prob*{\mat{F_l^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}} \mat{F_{l-1}^{\prime\pix{k}}}, \mat{Z_l^{\prime\pix{k}}}}
for an $L$-layer deep GP and with $\mat{F_0^{\prime\pix{k}}} \coloneqq \mat{X}$.
Similar to the single-layer case, we introduce sets of inducing points $\mat{Z_l^{\prime\pix{k}}}$ and a variational distirbution over their corresponding function values $\Variat*{\mat{u_l^{\prime\pix{k}}}} = \Gaussian*{\mat{u_l^{\prime\pix{k}}} \given \mat{m_l^{\prime\pix{k}}}, \mat{S_l^{\prime\pix{k}}}}$.
We collect the latent multi-layer function values as $\mat{F^\prime} = \Set{\mat{F_l^{\prime\pix{k}}}}_{k=1,l=1}^{K,L}$ and corresponding $\mat{U^\prime}$ and assume an extended variational distribution
\MoveEqLeft\Variat*{\mat{F^\prime}, \mat{\alpha}, \mat{U^\prime}} \\
= &\Variat*{\mat{\alpha}, \Set*{\mat{u_\alpha^{\pix{k}}}}_{k=1}^K, \Set*{\mat{F_l^{\prime\pix{k}}}, \mat{u_l^{\prime\pix{k}}}}_{k=1,l=1}^{K,L}} \\
= &\prod_{k=1}^K\prod_{n=1}^N \Prob*{\mat{\alpha_n^{\pix{k}}} \given \mat{u_\alpha^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u_\alpha^{\pix{k}}}} \\
&\prod_{k=1}^K \prod_{l=1}^L \prod_{n=1}^N \Prob*{\mat{f_{n,l}^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}}, \mat{x_n}}\Variat*{\mat{u_l^{\prime\pix{k}}}},
where we identify $\mat{f_n^{\prime\pix{k}}} = \mat{f_{n,L}^{\prime\pix{k}}}$.
As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on the $\nth{n}$ marginal of all layers above, $\Variat{\mat{f_n^{\prime\pix{k}}}} = \int \prod_{l=1}^{L-1} \Variat{\mat{f_{n,l}^{\prime\pix{k}}}} \diff \mat{f_{n,l}^{\prime\pix{k}}}$, sampling from these marginals remains straightforward \parencite[][Remark 2]{salimbeni_doubly_2017}.
The resulting bound is structurally similar to \cref{eq:variational_bound} and given by
&= \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{f^\prime_n}}}}{\log \Prob*{\mat{y_n} \given \mat{f^\prime_n}, \mat{a_n}}} \\
&\quad + \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{\alpha_n}}}}{\log \Prob*{\mat{a_n} \given \mat{\alpha_n}}} \\
&\quad - \sum_{k=1}^K \sum_{l=1}^L \KL{\Variat{\mat{u_l^{\pix{k}}}}}{\Prob{\mat{u_l^{\pix{k}}} \given \mat{Z_l^{\pix{k}}}}} \\
&\quad - \sum_{k=1}^K \KL{\Variat*{\mat{u_\alpha^{\pix{k}}}}}{\Prob*{\mat{u_\alpha^{\pix{k}}} \given \mat{Z_\alpha^{\pix{k}}}}}.
To calculate the first term, samples have to be propagated through the deep GP structures.
This extended bound thus has complexity $\Fun*{\Oh}{NM^2LK}$ to evaluate in the general case and complexity $\Fun*{\Oh}{NM^2\cdot\Fun{\max}{L, K}}$ if the assignments $\mat{a_n}$ take binary values.
In this section we investigate the behavior of the DMGP model in multiple regression settings.
First, we apply the DMGP to an artificial data set and showcase how the different components of the model interact to identify unimodal and multi-modal parts of the input space.
Second, we show how different priors on the different modes can be used to separate a signal from unrelated noise.
\todo{Reformulate when intro is done}And third, we investigate a data set which contains observations of two independent dynamical systems mixed together and show how the DMGP can recover information about both systems.
We use an implementation of DMGP in TensorFlow \parencite{tensorflow2015-whitepaper} based on GPflow \parencite{matthews_gpflow_2017} and the implementation of doubly stochastic variational inference \parencite{salimbeni_doubly_2017}.
\subsection{Artificial data set}
Joint posterior.
Assignment probabilities.
The DMGP posterior on an artificial data set with unimodal and bimodal parts. Best viewed in color.
\Cref{fig:semi_bimodal:a} shows the joint predictions which are mixtures of two gaussians weighed by the assignment probabilities shown in \cref{fig:semi_bimodal:c}.
These weights are represented as opacities of the corresponding modes, which shows that the green mode is only relevant roughly in the interval $[0, 5]$.
\Cref{fig:semi_bimodal:b} shows the posterior estimates of the assignment of the training data to the second mode.
Normalized samples from the assignment process $\mat{\alpha}$ of the model shown in \cref{fig:semi_bimodal}.
These samples are used to weigh the two modes depending on the position in the input space.
The model has learned that the second mode is only relevant roughly in the interval $[0, 5]$.
To demonstrate inference in our model, we begin with an experiment based on an artificial data set.
The data, together with recovered posterior attributions, can be seen in \cref{fig:semi_bimodal:b}.
We uniformly sampled 350 data points in the interval $ x \in [-2\pi, 2\pi]$ and obtain $y$ as $y = \Fun{\sin}{x} - \delta \cdot 2 \Fun{\exp}{-0.5 \cdot (x-2)^2} + \epsilon$ with a Bernoulli-distributed $\delta \sim \Fun{\Ber}{0.5}$ to introduce a multi-modality and additive independent noise $\epsilon \sim \Gaussian*{0, 0.005^2}$.
We use squared exponential kernels as priors for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and $M=25$ pseudo-inputs in every GP.
\Cref{fig:semi_bimodal,fig:semi_bimodal:c} show the posterior of a bimodal DMGP applied to the data which correctly identified the underlying functions.
\Cref{fig:semi_bimodal:b} shows the posterior belief about about the assignments $\mat{A}$ and illustrates that DMGP separated the input space in a part which is unimodal where all points have been assigned to the same mode and a bimodal part where both modes are necessary to explain the data.
This distinction is explicitly represented in the model via the assignment processes $\mat{\alpha}$ shown in \cref{fig:semi_bimodal:c}.
The model has learned that the second mode is irrelevant at, say, $x=-5$ and predictions using \cref{eq:predictive_posterior} simplify to a standard GP posterior.
The DMGP is implicitly incentivsed to explain the data using only one mode if possible through the likelihood term of the inferred $\mat{a_n}$ in \cref{eq:variational_bound}.
At $x = -10$ both the two modes and the assignment processes start reverting to their respective priors away from the data.
\subsection{Robust Regression}
Results on the ChoiceNet data set.
The gray part of the table show the results from \parencite{choi_choicenet_2018}.
We report root mean squared error (RMSE) comparable to the previous results together with mean log likelihoods (MLL).
Outliers & DMGP (MLL) & DMGP (RMSE) & CN & MDN & MLP & GPR & LGPR & RGPR \\
0\,\% & 2.86 & \textbf{0.008} & 0.034 & 0.028 & 0.039 & \textbf{0.008} & 0.022 & 0.017 \\
20\,\% & 2.71 & \textbf{0.008} & 0.022 & 0.087 & 0.413 & 0.280 & 0.206 & 0.013 \\
40\,\% & 2.12 & \textbf{0.005} & 0.018 & 0.565 & 0.452 & 0.447 & 0.439 & 1.322 \\
60\,\% & 0.874 & 0.031 & \textbf{0.023} & 0.645 & 0.636 & 0.602 & 0.579 & 0.738 \\
80\,\% & 0.126 & 0.128 & \textbf{0.084} & 0.778 & 0.829 & 0.779 & 0.777 & 1.523 \\
Joint posterior with 40\,\% outliers.
Assignment probabilities with 40\,\% outliers.
Joint posterior with 60\,\% outliers.
Assignment probabilities with 60\,\% outliers.
The DMGP posterior on the ChoiceNet data set with 40\,\% outliers (upper row) and 60\,\% outliers (lower row).
The bimodal DMGP with an RBF kernel in the first and a white noise kernel in the second mode identifies the underlying signal perfectly up to 40\,\% outliers.
For 60\,\% outliers, some of the noise is interpreted as signal, but the latent function is still recovered faithfully.
Our second experiment applies DMGP to a one-dimensional regression problem with uniformly distributed outliers in the training data.
We use a task proposed by \textcite{choi_choicenet_2018} where we sample inputs $x \in [-3, 3]$ uniformly and apply the function $\Fun{f}{x} = (1 - \delta)\Fun{\cos}{\sfrac{\pi}{2} \cdot x}\Fun{\exp}{-(\sfrac{x}{2})^2} + \delta \cdot \epsilon$, where $\delta \sim \Fun{\Ber}{\lambda}$ and $\epsilon \sim \Fun{\Uni}{-1, 3}$.
That is, a fraction $\lambda$ of the training data, the outliers, is replaced by uniform noise.
We sample a total of 1000 data points and use $M=25$ pseudo inputs for every GP in our model.
Every mode in our model can use a different kernel and therefore encode different prior assumptions about the modes.
For robust regression, we use a bimodal model, one of which uses a standard RBF kernel and the other using a white noise kernel.
To avoid pathological solutions for high outlier ratios, we add a prior to the likelihood variance of the mode modelling the function which encodes our assumption that there actually is a signal in the training data.
The model proposed in \parencite{choi_choicenet_2018} called ChoiceNet (CN) is a specifically designed neural network structure and inference algorithm to deal with corrupted data.
In their work, they compare their approach to a standard multilayer perceptron (MLP), a mixture density network (MDN), standard Gaussian process regression (GPR), leveraged Gaussian process regression (LGPR) \parencite{choi_robust_2016} and infinite mixtures of Gaussian Processes (RGPR) \parencite{rasmussen_infinite_2002}.
Results are shown in \cref{tab:choicenet} for outlier rates varied from 0\,\% to 80\,\%.
Besides the root mean squared error (RMSE), we also report the mean test log likelihood (MLL) of the mode representing the target function in our model.
Up to an outlier rate of 40\,\%, our model correctly identifies the outliers and ignores them, resulting in a predictive posterior of the signal-mode equivalent to standard GP regression without outliers.
In the special case of 0\,\% outliers, DMGP correctly identifies that the mode modelling the noise is not necessary and disables it, thereby simplifying itself to standard GP regression.
For high outlier rates, strong prior knowledge about the signal would be required to still identify it perfectly.
\Cref{fig:choicenet} shows the posterior for an outlier rate of 60\,\%.
While the function has still been identified well, some of the noise is also explained using this mode, thereby introducing slight errors in the predictions.
\subsection{Mixed Cart-pole systems}
Results on the cart-pole data set.
We report mean log likelihoods with standard error for 10 runs.
% table-align-uncertainty,
& & \multicolumn{2}{c}{Mixed} & \multicolumn{2}{c}{Default only} & \multicolumn{2}{c}{Short-pole only} \\
\cmidrule(lr){3-4} \cmidrule(lr){5-6} \cmidrule(lr){7-8}
Runs & & {Train} & {Test} & {Train} & {Test} & {Train} & {Test} \\
9 & DMGP & \bfseries 0.58 \pm 0.01 & \bfseries 0.52 \pm 0.01 & 0.855 \pm 0.002 & 0.845 \pm 0.002 & 0.690 \pm 0.009 & 0.604 \pm 0.005 \\
% 9 & U-DMGP & 0.473 \pm 0.002 & 0.426 \pm 0.001 & {\textemdash} & {\textemdash} & {\textemdash} & {\textemdash} \\
6 & DMGP 2 & \bfseries 0.56 \pm 0.02 & \bfseries 0.53 \pm 0.01 & 0.851 \pm 0.003 & \bfseries 0.859 \pm 0.002 & 0.679 \pm 0.021 & 0.602 \pm 0.017 \\
6 & DMGP 3 & 0.531 \pm 0.003 & 0.495 \pm 0.005 & 0.858 \pm 0.004 & 0.854 \pm 0.001 & 0.630 \pm 0.015 & 0.541 \pm 0.018 \\
6 & DMGP 4 & 0.517 \pm 0.006 & 0.490 \pm 0.004 & 0.858 \pm 0.002 & 0.853 \pm 0.002 & 0.608 \pm 0.008 & 0.560 \pm 0.010 \\
7 & DMGP 5 & 0.533 \pm 0.003 & 0.505 \pm 0.005 & 0.850 \pm 0.005 & 0.852 \pm 0.004 & 0.658 \pm 0.012 & 0.58 \pm 0.015 \\
10 & BNN+LV & 0.519 \pm 0.005 & \bfseries 0.524 \pm 0.005 & {\textemdash} & {\textemdash} & {\textemdash} & {\textemdash} \\
1 & GPR Mixed & 0.43 \pm 0.00 & 0.39 \pm 0.00 & {\textemdash} & {\textemdash} & {\textemdash} & {\textemdash} \\
1 & GPR Default & {\textemdash} & {\textemdash} & \bfseries 0.87 \pm 0.00 & \bfseries 0.86 \pm 0.00 & -7.99 \pm 0.00 & -8.77 \pm 0.00 \\
1 & GPR Short & {\textemdash} & {\textemdash} & -5.03 \pm 0.00 & -4.91 \pm 0.00 & \bfseries 0.88 \pm 0.00 & \bfseries 0.76 \pm 0.00 \\
Our third experiment is based on the cart-pole benchmark for reinforcement learning as described by \textcite{barto_neuronlike_1983}.
We use the implementation provided by OpenAI Gym \parencite{brockman_openai_2016}.
In this benchmark, the objective is to apply forces to a cart moving on a frictionless track to keep a pole attached to the cart by a joint in an upright position.
We consider the regression problem of predicting the change of the pole's angle given the current state of the cart and the action applied.
The current state of the cart is given by the cart's position and velocity and the pole's angular position and velocity.
\todo{Reformulate when intro is done} To simulate a dynamical system with changing states of operation, our experimental setup is to sample trajectories from two different cart-pole systems and merging the resulting data into one training set.
The task is to not only learn a model which explains this data well but to recover the multi-modalities introduced by the different system configurations.
We sample trajectories from the system by initializing the pole in an almost upright position and then applying 10 uniform random actions.
We add Gaussian noise $\epsilon \sim \Gaussian*{0, 0.01^2}$ to the observed angle changes.
To increase the non-linearity of the dynamics we apply the action for 5 consecutive time steps and allow the pole to swing freely instead of ending the trajectory after reaching a specific angle.
The data set consists of 500 points sampled from the \emph{default} cart-pole system and another 500 points sampled from a \emph{short-pole} cart-pole system in which we halve the mass of the pole to 0.05 and shorten the pole to 0.1, a tenth of its default length.
This short-pole system is more unstable and the pole reaches higher speeds.
Predictions in this system therefore have to take the multi-modality into account, as mean predictions between the more stable and the more unstable system can never be observed.
We consider three test sets, one sampled from the default system, one sampled from the short-pole system and the concatenation of the two.
For this data set, we use squared exponential kernels for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and $M = 100$ pseudo-inputs in every GP.
Additionally, we evaluate the performance of deep GPs with up to 5 layers and squared exponential kernels as models for the different modes.
As described in \parencite{salimbeni_doubly_2017}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.
We compare our models with three-layer relu-activated Bayesian neural networks with added latent variables as introduced by \textcite{depeweg_learning_2016}.
These latent variables can be used to effectively model multi-modalities and stochasticity in dynamical systems for model-based reinforcement learning \parencite{depeweg_decomposition_2018}.
We compare also to three kinds of sparse GPs \parencite{hensman_scalable_2015} trained on the mixed data set and trained on data from only one of the two modes.
\Cref{tab:cartpole} shows mean training and test log likelihoods and their standard error over 10 runs for these models.
The \emph{mixed}-column corresponds to training and test log likelihoods for a standard regression problem, which in this case is a bimodal one.
The GPR model trained on the mixed data set shows the worst performance since its predictions are single Gaussians spanning both modes.
Additionally, the mean prediction is approximately the mean of the two modes and is physically implausible.
Both the BNN+LV and DMGP models perform substantially better as they can model the bimodality.
BNN+LV assumes continuous latent variables and a bimodal distribution can be recovered by approximately marginalizing these latent variables via sampling.
The predictive posterior of unknown shape is approximated using a mixture of many Gaussians.
Compared to the shallow DMGP, the prior of BNN+LV is harder to interpret, as the DMGP's generative process produces a mixture of two Gaussians representing the two modes in the data.
Adding more layers to the DMGP model leads to more expressive models whose prior on the different modes becomes less informative.
\todo{Revisit with final results}For this cart-pole data, two-layer deep GPs seem to be a good compromise between model expressiveness and the strength of the prior.
On the \emph{mixed} test set, DMGP and BNN+LV both show comparable likelihoods.
However, the DMGP is a more expressive model whose different modes can be evaluated further.
The results in the \emph{default only} and \emph{short-pole only} columns compare training and test likelihoods on the parts of the training and test sets corresponding to these systems respectively.
We calculate these values by evaluating both modes separately on the data sets and reporting the better result\todo{Should we argue why this is not cheating?}.
We compare these results with sparse GP models trained only on the respective systems.
The two modes of DMGP reliably separate the two different systems.
In fact, the mode corresponding to the \emph{default} system in the two-layer DMGP shows equal test performance to the corresponding GPR model trained only on data of this mode.
The \emph{default} and \emph{short-pole} systems are sufficiently different such that the sparse GPs trained on only one of the two sets performs very poorly.
Out of these two systems, the \emph{short-pole} system is more complicated and harder to learn.
The second mode of DMGP still recovers an adequate model.
Given the fact that the two modes of DMGP model the two system dynamics in the original data, sampling trajectories from them results in physically plausible data, which is not possible with a sparse GP or BNN+LV model.
\item Unser Modell ist das beste Modell, das die Welt je gesehen hat.