123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585 
 \input{preamble/packages.tex}
 \input{preamble/abbreviations.tex}
 \input{figures/preamble/tikz_colors.tex}

 % We use precompiled images and do not add tikz for speed of compilation.
 \newcommand{\includestandalonewithpath}[2][]{%
 \begingroup%
 \StrCount{#2}{/}[\matches]%
 \StrBefore[\matches]{#2}{/}[\figurepath]%
 \includestandalone[#1]{#2}%
 \endgroup%
 }

 % Show overfull boxes
 % \overfullrule=5pt

 \addbibresource{zotero_export.bib}
 \addbibresource{additional.bib}

 % We set this for hyperref
 \title{Multimodal Deep Gaussian Processes}
 % \author{\href{mailto:markus.kaiser@siemens.com}{Markus Kaiser}}
 \author{Anonymous}

 \begin{document}
 \twocolumn[
 \aistatstitle{Multimodal Deep Gaussian Processes}
 % \aistatsauthor{ Markus Kaiser \And Clemens Otte \And Thomas Runkler \And Carl Henrik Ek }
 % \aistatsaddress{ Siemens AG \\ Technical University of Munich \And Siemens AG \And Siemens AG \\ Technical University of Munich \And University of Bristol }
 \aistatsauthor{ Author 1 \And Author 2 \And Author 3 }
 \aistatsaddress{ Institution 1 \And Institution 2 \And Institution 3 }
 ]

 \begin{abstract}
 We propose a novel Bayesian approach to modelling multimodal data generated by multiple independent processes, simultaneously solving the data association and induced supervised learning problems.
 Underpinning our approach is the use of Gaussian process priors which encode structure both on the functions and the associations themselves.
 The association of samples and functions are determined by taking both inputs and outputs into account while also obtaining a posterior belief about the relevance of the global components throughout the input space.
 We present an efficient learning scheme based on doubly stochastic variational inference and discuss how it can be applied to deep Gaussian process priors.
 We show results for an artificial data set, a noise separation problem, and a multimodal regression problem based on the cartpole benchmark.
 \end{abstract}


 \section{Introduction}
 \label{sec:introduction}
 Estimating a function from data is a central aspect of machine learning and a host of different methods exist for this task.
 Fundamentally, a function is an object relating an input value to a single output value, often represented as elements of finitedimensional vector spaces.
 However, for some tasks not all relevant input dimensions can be observed, meaning that for each input location multiple outputs are possible due to changing missing information.
 One class of problems with these characteristics are dynamic systems which often can be in multiple states of operations and where this state itself is not observed.
 Examples include faulty sensors, which, at any time, might emit a correct reading or uninformative noise, or industrial systems with accumulating latent effects, which can induce bifurcation or hysteresis~\parencite{hein_benchmark_2017}.
 In this work we will investigate a data set derived from the cartpole benchmark, which contains trajectories of two instances of the benchmark with different pole lengths.
 This setup represents the difficulties of an industrial system in which, for example due to wear or defective parts, the underlying dynamics change over time.
 In this setting we want to recover both joint predictions marginalizing the current state of operation but also informative models for these separate states.

 Estimating a model in this scenario is often referred to as a \emph{data association problem}~\parencite{BarShalom:1987, Cox93areview}, where both the different functions and the associations of the observations to a function need to be estimated.
 A simple example of this can be seen in \cref{fig:semi_bimodal}, where no single function could have generated the data.
 A slightly different view of the same problem is to consider the data to have been generated by a mixture of processes, where we are interested in factorising the data into these components~\parencite{choi_choicenet_2018}.
 The separation of an underlying signal and noise processes is an application of the latter, where we consider certain observations to be noise and others to be signal~\parencite{rousseeuw_robust_2005,hodge_survey_2004}.

 Early approaches to explaining data using multiple generative processes are based on separating the input space and training local expert models explaining easier subtasks~\parencite{jacobs_adaptive_1991,tresp_mixtures_2001, rasmussen_infinite_2002}.
 The assignment of data points to local experts is handled by a gating network, which learns a function from the inputs to assignment probabilities.
 However, it is still a central assumption of these models that the underlying generative process is unimodal.
 That is, at every position in the input space exactly one expert explains the data.
 Another approach is presented in~\parencite{bishop_mixture_1994}, where the multimodal regression tasks are interpreted as a density estimation problem.
 A high number of candidate distributions are reweighed to match the observed data without modeling the underlying generative process.

 In contrast, we are interested in a generative process, where data at the same location in the input space could have been generated by a number of global independent processes.
 Inherently, the data association problem is illposed and requires assumptions on both the underlying functions and the association of the observations.
 In~\parencite{lazarogredilla_overlapping_2012} the authors place Gaussian process priors on the different generative processes which are assumed to be relevant globally.
 The associations are modelled via a latent association matrix and inference is done using an expectation maximization algorithm.
 This approach takes both the inputs and outputs of the training data into account to solve the association problem.
 A drawback is that the model cannot give a posterior estimate about the relevance of the different modes at different locations in the input space.
 This means that the model can be used for data exploration but additional information is needed in order to perform predictive tasks.
 Another approach in~\parencite{bodin_latent_2017} expands this model by allowing interdependencies between the different generative processes and formulating the association problem as an inference problem on a latent space and a corresponding covariance function.
 However, in this approach the number of components is a free parameter and is prone to overfitting, as the model has no means of turning off components.

 In this paper we formulate a Bayesian model for the data association problem.
 Underpinning our approach is the use of Gaussian process priors which encode structure both on the functions and the associations themselves.
 The use of Gaussian process priors allows us to achieve principled regularization without reducing the solution space leading to a wellregularized learning problem.
 Importantly, we simultaneously solve the association problem for the training data taking both inputs and outputs into account while also obtaining posterior belief about the relevance of the different modes in the input space.
 Our model can describe nonstationary processes in the sense that a different number of modes can be activated in different locations in the input space.
 We describe this nonstationary structure using additional Gaussian process priors which allows us to make full use of problem specific knowledge.
 This leads to a flexible yet interpretable model with a principled treatment of uncertainty.

 The paper has the following contributions:
 In \cref{sec:model}, we propose the multimodal deep Gaussian process model (MDGP).
 In \cref{sec:variational_approximation}, we present an efficient learning scheme via a variational approximation which allows us to simultaneously train all parts of our model via stochastic optimization.
 We discuss how the same learning scheme can be applied to deep Gaussian process priors.
 We demonstrate our model on an artificial data set, a noise separation problem, and a multimodal regression problem based on the cartpole benchmark in \cref{sec:experiments}.


 \section{The Multimodal Deep Gaussian Process Model}
 \label{sec:model}
 \begin{figure}[t]
 \centering
 \includestandalone{figures/dynamic_graphical_model}
 \caption{
 \label{fig:dynamic_graphical_model}
 The graphical model of MDGP.
 The violet observations $(\mat{x_n}, \mat{y_n})$ are generated by the green latent process.
 Exactly one of the $K$ latent functions $f^{\pix{k}}$ and likelihood $\mat{y_n^{\pix{k}}}$ are evaluated to generate $\mat{y_n}$.
 We can place shallow or deep GP priors on these latent function values $\mat{f_n^{\pix{k}}}$.
 The assignment $\mat{a_n}$ to a latent function is driven by inputdependent weights $\mat{\alpha_n^{\pix{k}}}$ which encode the relevance of the different modes at $\mat{x_n}$.
 The different parts of the model are determined by the yellow hyperparameters and blue variational parameters.
 }
 \end{figure}
 The multimodal deep Gaussian process (MDGP) model assumes that there exist $K$ independent functions $\Set*{f^{\pix{k}}}_{k=1}^K$ (or \emph{modes}), which generate pairs of observations $\D = \Set*{(\mat{x_n}, \mat{y_n})}_{n=1}^N$ in a data set.
 Every data point is generated by evaluating one of the $K$ latent functions and adding Gaussian noise from a corresponding likelihood.
 The assignment of the $\nth{n}$ data point to one of the modes is specified by the indicator vector $\mat{a_n} \in \Set*{0, 1}^K$, which has exactly one nonzero entry.
 Our goal is to formulate simultaneous Bayesian inference on the functions $f^{\pix{k}}$ and the assignments $\mat{a_n}$.

 For notational conciseness we collect all $N$ inputs as $\mat{X} = \left(\mat{x_1}, \ldots, \mat{x_N}\right)$ and all outputs as $\mat{Y} = \left(\mat{y_1}, \ldots, \mat{y_N}\right)$.
 We further denote the $\nth{k}$ latent function value associated with the $\nth{n}$ data point as $\rv{f_n^{\pix{k}}} = \Fun{f^{\pix{k}}}{\mat{x_n}}$ and collect them as $\mat{F^{\pix{k}}} = \left( \rv{f_1^{\pix{k}}}, \ldots, \rv{f_N^{\pix{k}}} \right)$ and $\mat{F} = \left( \mat{F^{\pix{1}}}, \ldots, \mat{F^{\pix{K}}} \right)$.
 We refer to the $\nth{k}$ entry in $\mat{a_n}$ as $a_n^{\pix{k}}$ and denote $\mat{A} = \left(\mat{a_1}, \ldots, \mat{a_N}\right)$.

 Given this notation, the marginal likelihood of the MDGP can be separated in the likelihood, the latent function processes, and the assignment process and is given by
 \begin{align}
 \begin{split}
 % NOTE(mrksr): I am soo sorry :(
 \label{eq:true_marginal_likelihood}
 &\!\!\!\Prob*{\mat{Y} \given \mat{X}} \! =
 \!\!\int\!
 \Prob*{\mat{Y} \given \mat{F}, \mat{A}}
 \Prob*{\mat{F} \given \mat{X}}
 \Prob*{\mat{A} \given \mat{X}}
 \diff \mat{A} \diff \mat{F} \\
 &\!\!\!\Prob*{\mat{Y} \given \mat{F}, \mat{A}} \! =
 \!\prod_{n=1}^N\prod_{k=1}^K
 \Gaussian*{\mat{y_n} \given \mat{f_n^{\pix{k}}}, \left(\sigma^{\pix{k}}\right)^2}_{^{\displaystyle,}}^{\Fun{\Ind}{a_n^{\pix{k}} = 1}}
 \end{split}
 \end{align}
 where $\sigma^{\pix{k}}$ is the noise level of the $\nth{k}$ Gaussian likelihood and $\Ind$ is the indicator function.

 Since we assume the $K$ modes to be independent given the data and assignments, we place independent GP priors on the latent functions
 $\Prob*{\mat{F} \given \mat{X}} = \prod_{k=1}^K \Gaussian*{\mat{F^{\pix{k}}} \given \Fun*{\mu^{\pix{k}}}{\mat{X}}, \Fun*{k^{\pix{k}}}{\mat{X}, \mat{X}}}$.
 Our prior on the assignment process is composite.
 First, we assume that the $\mat{a_n}$ are drawn independently from multinomial distributions with logit parameters $\mat{\alpha_n} = \left( \alpha_n^{\pix{1}}, \ldots, \alpha_n^{\pix{K}} \right)$.
 A usual approach to specifying the $\mat{\alpha_n}$ is to assume them to be known a priori and to be equal for all data points~\parencite{lazarogredilla_overlapping_2012}.
 Instead, we want to infer them from the data.
 Specifically, we assume that there is a relationship between the location in the input space $\mathbf{x}$ and the associations.
 By placing independent GP priors on $\mat{\alpha^{\pix{k}}}$, we can encode our prior knowledge of the associations by the choice of covariance function
 $\Prob*{\mat{\alpha} \given \mat{X}} = \prod_{k=1}^K \Gaussian*{\rv{\alpha^{\pix{k}}} \given \mat{0}, \Fun{k_\alpha^{\pix{k}}}{\mat{X}, \mat{X}}}$.
 The prior on the assignments $\mat{A}$ is given by marginalizing the $\mat{\alpha^{\pix{k}}}$, which, when normalized, parametrize a batch of multinomial distributions
 \begin{align}
 \begin{split}
 \label{eq:multinomial_likelihood}
 \Prob*{\mat{A} \given \mat{X}} &=
 \int
 \Multinomial*{\mat{A} \given \Fun{\softmax}{\mat{\alpha}}} \Prob*{\mat{\alpha} \given \mat{X}}
 \diff \rv{\alpha}.
 \end{split}
 \end{align}
 Modelling the relationship between the input and the associations allows us to efficiently model data, which, for example, is unimodal in some parts of the input space and bimodal in others.
 A simple smoothness prior will encode a belief for how quickly we believe the components switch across the input domain.

 Since the GPs of the $\mat{\alpha^{\pix{k}}}$ use a zero mean function, our prior assumption is a uniform distribution of the different modes everywhere in the input space.
 If inference on the $\mat{a_n}$ reveals that, say, all data points at similar positions in the input space can be explained by the same $\nth{k}$ mode, the belief about $\mat{\alpha}$ can be adjusted to make a nonuniform mode distribution favorable at this position, thereby increasing the likelihood via $\Prob*{\mat{A} \given \mat{X}}$.
 This mechanism introduces an incentive for the model to use as few modes as possible to explain the data and allows us to predict a relative importance of the modes when calculating the posterior of new observations $\mat{x_\ast}$.

 \Cref{fig:dynamic_graphical_model} shows the resulting graphical model, which divides the generative process for every data point in the application of the latent functions on the left side and the assignment process on the right side.
 Interdependencies between the data points are introduced through the Gaussian process priors on $\rv{f_n^{\pix{k}}}$ and $\rv{\alpha_n^{\pix{k}}}$ and depend on the hyperparameters $\mat{\theta} = \Set*{\mat{\theta^{\pix{k}}}, \mat{\theta_\alpha^{\pix{k}}}, \sigma^{\pix{k}}}_{k=1}^K$.

 The priors for the $f^{\pix{k}}$ can be chosen independently to encode different prior assumptions about the modes.
 In \cref{subsec:choicenet} we use different kernels to separate a nonlinear signal from a noise process.
 Going further, we can also use deep Gaussian processes as priors for the $f^{\pix{k}}$~\parencite{damianou_deep_2013, salimbeni_doubly_2017}.
 Since many real word systems are inherently hierarchical, prior knowledge can often be formulated more easily using composite functions~\parencite{kaiser_bayesian_2017}.


 \section{Variational Approximation}
 \label{sec:variational_approximation}
 Exact inference is intractable in this model.
 Instead, we now formulate a variational approximation following ideas from~\parencite{hensman_gaussian_2013, salimbeni_doubly_2017}.
 Because of the rich structure in our model, finding a variational lower bound which is both faithful and can be evaluated analytically is hard.
 Instead, we formulate an approximation which factorizes along both the $K$ modes and $N$ data points.
 This bound can be sampled efficiently and allows us to optimize both the models for the different modes $\Set*{f^{\pix{k}}}_{k=1}^K$ and our belief about the data assignments $\Set*{\mat{a_n}}_{n=1}^N$ simultaneously using stochastic optimization.

 \subsection{Variational Lower Bound}
 \label{subsec:lower_bound}
 As first introduced by~\textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.
 We collect them as $\mat{Z} = \Set*{\mat{Z^{\pix{k}}}, \mat{Z_\alpha^{\pix{k}}}}_{k=1}^K$ and $\mat{U} = \Set*{\mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K$.
 Taking the function $f^{\pix{k}}$ and its corresponding GP as an example, the inducing variables $\mat{u^{\pix{k}}}$ are jointly Gaussian with the latent function values $\mat{F^{\pix{k}}}$ of the observed data by the definition of GPs.
 We follow~\parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.
 This formulation introduces the set $\Set*{\mat{Z^{\pix{k}}}, \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$ of variational parameters indicated in~\cref{fig:dynamic_graphical_model}.
 To simplify notation we drop the dependency on $\mat{Z}$ in the following.

 A central assumption of this approximation is that given enough wellplaced inducing variables $\mat{u^{\pix{k}}}$, they are a sufficient statistic for the latent function values $\mat{F^{\pix{k}}}$.
 This implies conditional independence of the $\mat{f_n^{\pix{k}}}$ given $\mat{u^{\pix{k}}}$ and $\mat{X}$.
 The variational posterior of a single GP can then be written as,
 \begin{align}
 \begin{split}
 \Variat*{\mat{F^{\pix{k}}} \given \mat{X}}
 % &=
 % \int \Variat*{\mat{u^{\pix{k}}}}
 % \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}}
 % \diff \mat{u^{\pix{k}}}
 % \\
 &=
 \int \Variat*{\mat{u^{\pix{k}}}}
 \prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{x_n}}
 \diff \mat{u^{\pix{k}}},
 \end{split}
 \end{align}
 which can be evaluated analytically, since it is a convolution of Gaussians.
 This formulation simplifies inference within single Gaussian processes.
 Next, we discuss how to handle correlations between the different modes and the assignment processes.

 Given a set of assignments $\mat{A}$, this factorization along the data points is preserved in our model due to the assumed independence of the different modes in~\cref{eq:true_marginal_likelihood}.
 The independence is lost if the assignments are unknown.
 In this case, both the (a priori independent) assignment processes and the modes influence each other through data with unclear assignments.
 Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by~\textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.
 That is, our variational posterior takes the factorized form,
 \begin{align}
 \begin{split}
 \label{eq:variational_distribution}
 \Variat*{\mat{F}, \mat{\alpha}, \mat{U}}
 = &\Variat*{\mat{\alpha}, \Set*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K} \\
 = &\prod_{k=1}^K\prod_{n=1}^N \Prob*{\mat{\alpha_n^{\pix{k}}} \given \mat{u_\alpha^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u_\alpha^{\pix{k}}}} \\
 &\prod_{k=1}^K \prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u^{\pix{k}}}}.
 \end{split}
 \end{align}

 Our goal is to recover a posterior for both the different modes and the data assignment.
 To achieve this, instead of marginalizing $\mat{A}$, we consider the variational joint of $\mat{Y}$ and $\mat{A}$,
 \begin{align}
 \begin{split}
 \Variat*{\mat{Y}, \mat{A}} =
 &\!\int\! % We cheat a bit to make the line fit.
 \Prob*{\mat{Y} \given \mat{F}, \mat{A}}
 \Prob*{\mat{A} \given \mat{\alpha}}
 \Variat*{\mat{F}, \mat{\alpha}}
 \diff \mat{F} \diff \mat{\alpha},
 \end{split}
 \end{align}
 which retains both the Gaussian likelihood of $\mat{Y}$ and the multinomial likelihood of $\mat{A}$ shown in \cref{eq:multinomial_likelihood}.
 A lower bound $\Ell_{\text{MDGP}}$ for the logjoint $\log\Prob*{\mat{Y}, \mat{A} \given \mat{X}}$ of the MDGP is given by
 \begin{align}
 \begin{split}
 \label{eq:variational_bound}
 \Ell_{\text{MDGP}} &= \Moment*{\E_{\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}}}{\log\frac{\Prob*{\mat{Y}, \mat{A}, \mat{F}, \mat{\alpha}, \mat{U} \given \mat{X}}}{\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}}} \\
 &= \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{f_n}}}}{\log \Prob*{\mat{y_n} \given \mat{f_n}, \mat{a_n}}} \\
 &\quad + \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{\alpha_n}}}}{\log \Prob*{\mat{a_n} \given \mat{\alpha_n}}} \\
 &\quad  \sum_{k=1}^K \KL{\Variat*{\mat{u^{\pix{k}}}}}{\Prob*{\mat{u^{\pix{k}}} \given \mat{Z^{\pix{k}}}}} \\
 &\quad  \sum_{k=1}^K \KL{\Variat*{\mat{u_\alpha^{\pix{k}}}}}{\Prob*{\mat{u_\alpha^{\pix{k}}} \given \mat{Z_\alpha^{\pix{k}}}}}.
 \end{split}
 \end{align}
 Due to the structure of~\cref{eq:variational_distribution}, the bound factorizes along the data enabling stochastic optimization.
 This bound has complexity $\Fun*{\Oh}{NM^2K}$ to evaluate.

 \subsection{Optimization of the Lower Bound}
 \label{subsec:computation}
 An important property of the variational bound for DSVI described in~\parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.
 Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations~\parencite{kingma_variational_2015,rezende_stochastic_2014}.

 While it would not be necessary to sample from the different modes, since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically~\parencite{hensman_gaussian_2013}, we apply this idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.
 For $\mat{\alpha}$, the analytical propagation of uncertainties through the $\softmax$ renormalization and multinomial likelihoods is intractable but can easily be evaluated using sampling.

 We optimize $\Ell_{\text{MDGP}}$ to simultaneously recover maximum likelihood estimates of the hyperparameters $\mat{\theta}$, the variational parameters $\Set*{\mat{Z}, \mat{m}, \mat{S}}$, and assignments $\mat{A}$.
 For every $n$, we represent the belief about $\mat{a_n}$ as a $K$dimensional discrete distribution $\Variat*{\mat{a_n}}$.
 This distribution models the result of drawing a sample from $\Multinomial*{\mat{a_n} \given \Fun{\softmax}{\mat{\alpha_n}}}$ during the generation of the data point $(\mat{x_n}, \mat{y_n})$.

 Since we want to optimize $\Ell_{\text{MDGP}}$ using (stochastic) gradient descent, we need to employ a continuous relaxation to gain informative gradients of the bound with respect to the binary (and discrete) vectors $\mat{a_n}$.
 One straightforward way to relax the problem is to use the current belief about $\Variat*{\mat{a_n}}$ as parameters for a convex combination of the $\mat{f_n^{\pix{k}}}$, that is, to approximate $\mat{f_n} \approx \sum_{k=1}^K \Variat*{\mat{a_n^{\pix{k}}}}\mat{\hat{f}_n^{\pix{k}}}$.
 Using this relaxation causes multiple problems in practice.
 Most importantly, explaining data points as mixtures of modes can substantially simplify the learning problem while violating the modelling assumption that every data point was generated using exactly one mode.
 Because of this, special care must be taken during optimization to enforce the sparsity of $\Variat*{\mat{a_n}}$.

 To avoid this problem, we propose using a different relaxation based on additional stochasticity.
 Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by~\textcite{maddison_concrete_2016}, whose parameters are given by $\Variat*{\mat{a_n}}$.
 Based on a temperature parameter $\lambda$, a Concrete random variable enforces sparsity but is also continuous and yields informative gradients using automatic differentiation.
 Samples from a Concrete random variable are unit vectors and for $\lambda \to 0$ their distribution approaches a discrete distribution.

 Our approximate evaluation of the bound in \cref{eq:variational_bound} during optimization has multiple sources of stochasticity, all of which are unbiased.
 First, we approximate the expectations using Monte Carlo samples $\mat{\hat{f}_n^{\pix{k}}}$, $\mat{\hat{\alpha}_n^{\pix{k}}}$, and $\mat{\hat{a}_n}$.
 And second, the factorization of the bound along the data allows us to use minibatches for optimization~\parencite{salimbeni_doubly_2017, hensman_gaussian_2013}.


 \subsection{Approximate Predictions}
 \label{subsec:predictions}
 Predictions for a test location $\mat{x_\ast}$ are mixtures of $K$ independent Gaussians, given by
 \begin{align}
 \begin{split}
 \label{eq:predictive_posterior}
 \Variat*{\mat{f_\ast} \given \mat{x_\ast}}
 &= \int \sum_{k=1}^K \Variat*{a_\ast^{\pix{k}} \given \mat{x_\ast}} \Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}} \diff \mat{a_\ast^{\pix{k}}}\\
 &\approx \sum_{k=1}^K \hat{a}_\ast^{\pix{k}} \mat{\hat{f}_\ast^{\pix{k}}}.
 \end{split}
 \end{align}
 The predictive posteriors of the $K$ modes $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically~\parencite{hensman_gaussian_2013}.
 Samples from the predictive density over $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ can be obtained by sampling from the Gaussian process posteriors $\Variat*{\mat{\alpha_\ast^{\pix{k}}} \given \mat{x_\ast}}$ and renormalizing the resulting vector $\mat{\alpha_\ast}$ using the $\softmax$function.
 The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model's belief about how many and which of the $K$ modes are relevant at the test location $\mat{x_\ast}$.


 %
 \begin{figure*}[t]
 \centering
 \begin{subfigure}{.495\linewidth}
 \centering
 \includestandalone{figures/semi_bimodal_joint}
 % \caption{
 % \label{fig:semi_bimodal:a}
 % Joint posterior.
 % }
 \end{subfigure}
 \hfill
 \begin{subfigure}{.495\linewidth}
 \centering
 \includestandalone{figures/semi_bimodal_attrib}
 % \caption{
 % \label{fig:semi_bimodal:b}
 % }
 \end{subfigure}
 \caption{
 \label{fig:semi_bimodal}
 The MDGP posterior on an artificial data set with unimodal and bimodal parts.
 The left plot shows the joint predictions which are mixtures of two Gaussians weighed by the assignment probabilities shown in \cref{fig:semi_bimodal:c}.
 The weights are represented via the opacity of the modes, which shows that the green mode is only relevant around the interval $[0, 5]$.
 The right plot shows the posterior estimates of the assignment of the training data to the second mode.
 }
 \end{figure*}
 %
 \begin{figure}[t]
 \centering
 \includestandalone{figures/semi_bimodal_attrib_process}
 \caption{
 \label{fig:semi_bimodal:c}
 Normalized samples from the assignment process $\mat{\alpha}$ of the model shown in \cref{fig:semi_bimodal}.
 These samples are used to weigh the two modes depending on the position in the input space.
 The model has learned that the second mode is only relevant around the interval $[0, 5]$.
 }
 \end{figure}
 %
 \subsection{Deep Gaussian Processes}
 \label{subsec:deep_gp}
 For clarity, we have described the variational bound in terms of a shallow GP.
 However, as long as their variational bound can be efficiently sampled, any model can be used in place of shallow GPs for the $f^{\pix{k}}$.
 Since our approximation is based on DSVI for deep Gaussian processes, an extension to deep GPs is straightforward.
 Analogously to~\parencite{salimbeni_doubly_2017}, our new prior assumption about the $\nth{k}$ latent function values $\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}}$ is given by
 \begin{align}
 \begin{split}
 \Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}} = \prod_{l=1}^L \Prob*{\mat{F_l^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}} \mat{F_{l1}^{\prime\pix{k}}}, \mat{Z_l^{\prime\pix{k}}}}
 \end{split}
 \end{align}
 for an $L$layer deep GP and with $\mat{F_0^{\prime\pix{k}}} \coloneqq \mat{X}$.
 Similar to the singlelayer case, we introduce sets of inducing points $\mat{Z_l^{\prime\pix{k}}}$ and a variational distribution over their corresponding function values $\Variat*{\mat{u_l^{\prime\pix{k}}}} = \Gaussian*{\mat{u_l^{\prime\pix{k}}} \given \mat{m_l^{\prime\pix{k}}}, \mat{S_l^{\prime\pix{k}}}}$.
 We collect the latent multilayer function values as $\mat{F^\prime} = \Set{\mat{F_l^{\prime\pix{k}}}}_{k=1,l=1}^{K,L}$ and corresponding $\mat{U^\prime}$ and assume an extended variational distribution
 \begin{align}
 \begin{split}
 \label{eq:deep_variational_distribution}
 \Variat*{\mat{F^\prime}, \mat{\alpha}, \mat{U^\prime}}
 % = &\Variat*{\mat{\alpha}, \Set*{\mat{u_\alpha^{\pix{k}}}}_{k=1}^K, \Set*{\mat{F_l^{\prime\pix{k}}}, \mat{u_l^{\prime\pix{k}}}}_{k=1,l=1}^{K,L}} \\
 = &\prod_{k=1}^K\prod_{n=1}^N \Prob*{\mat{\alpha_n^{\pix{k}}} \given \mat{u_\alpha^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u_\alpha^{\pix{k}}}} \\
 \MoveEqLeft\prod_{k=1}^K \prod_{l=1}^L \prod_{n=1}^N \Prob*{\mat{f_{n,l}^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}}, \mat{x_n}}\Variat*{\mat{u_l^{\prime\pix{k}}}},
 \end{split}
 \end{align}
 where we identify $\mat{f_n^{\prime\pix{k}}} = \mat{f_{n,L}^{\prime\pix{k}}}$.
 As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on depends only on the $\nth{n}$ marginal of all layers above sampling from them remains straightforward~\parencite{salimbeni_doubly_2017}.
 The marginal is given by
 \begin{align}
 \begin{split}
 \Variat{\mat{f_{n,L}^{\prime\pix{k}}}} =
 \int
 &\Variat{\mat{f_{n,L}^{\prime\pix{k}}} \given \mat{f_{n,L1}^{\prime\pix{k}}}} \\
 &\prod_{l=1}^{L1} \Variat{\mat{f_{n,l}^{\prime\pix{k}}} \given \mat{f_{n,l1}^{\prime\pix{k}}}}
 \diff \mat{f_{n,l}^{\prime\pix{k}}}.
 \end{split}
 \end{align}

 The complete bound is structurally similar to \cref{eq:variational_bound} and given by
 \begin{align}
 \begin{split}
 \label{eq:deep_variational_bound}
 \Ell^\prime_{\text{MDGP}}
 &= \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{f^\prime_n}}}}{\log \Prob*{\mat{y_n} \given \mat{f^\prime_n}, \mat{a_n}}} \\
 &\quad + \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{\alpha_n}}}}{\log \Prob*{\mat{a_n} \given \mat{\alpha_n}}} \\
 &\quad  \sum_{k=1}^K \sum_{l=1}^L \KL{\Variat{\mat{u_l^{\pix{k}}}}}{\Prob{\mat{u_l^{\pix{k}}} \given \mat{Z_l^{\pix{k}}}}} \\
 &\quad  \sum_{k=1}^K \KL{\Variat*{\mat{u_\alpha^{\pix{k}}}}}{\Prob*{\mat{u_\alpha^{\pix{k}}} \given \mat{Z_\alpha^{\pix{k}}}}}.
 \end{split}
 \end{align}
 To calculate the first term, samples have to be propagated through the deep GP structures.
 This extended bound thus has complexity $\Fun*{\Oh}{NM^2LK}$ to evaluate in the general case and complexity $\Fun*{\Oh}{NM^2\cdot\Fun{\max}{L, K}}$ if the assignments $\mat{a_n}$ take binary values.


 \section{Experiments}
 \label{sec:experiments}
 In this section we investigate the behavior of the MDGP model in multiple regression settings.
 First, we apply the MDGP to an artificial data set and showcase how the different components of the model interact to identify unimodal and multimodal parts of the input space.
 Second, we show how different priors on the different modes can be used to separate a signal from unrelated noise.
 Finally, we investigate a data set which contains observations of two independent dynamical systems mixed together and show how the MDGP can recover information about both systems and infer the state variable separating the systems.

 We use an implementation of MDGP in TensorFlow~\parencite{tensorflow2015whitepaper} based on GPflow~\parencite{matthews_gpflow_2017} and the implementation of DSVI~\parencite{salimbeni_doubly_2017}.


 \subsection{Artificial Data Set}
 \label{subsec:semi_bimodal}
 To demonstrate inference in our model, we begin with an experiment based on an artificial data set.
 The data, together with recovered posterior attributions, can be seen in \cref{fig:semi_bimodal}.
 We uniformly sampled 350 data points in the interval $ x \in [2\pi, 2\pi]$ and obtained $y$ as $y = \Fun{\sin}{x}  \delta \cdot 2 \Fun{\exp}{0.5 \cdot (x2)^2} + \epsilon$ with a Bernoullidistributed $\delta \sim \Fun{\Ber}{0.5}$ to introduce a multimodality and additive independent noise $\epsilon \sim \Gaussian*{0, 0.005^2}$.

 We use squared exponential kernels as priors for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and $25$ inducing points in every GP.
 \Cref{fig:semi_bimodal,fig:semi_bimodal:c} show the posterior of a bimodal MDGP applied to the data, which correctly identified the underlying functions.
 \Cref{fig:semi_bimodal} shows the posterior belief about the assignments $\mat{A}$ and illustrates that MDGP separated the input space in a part which is unimodal where all points have been assigned to the same mode and a bimodal part where both modes are necessary to explain the data.

 This distinction is explicitly represented in the model via the assignment processes $\mat{\alpha}$ shown in \cref{fig:semi_bimodal:c}.
 The model has learned that the second mode is irrelevant at, for example, $x=5$ and therefore predictions using \cref{eq:predictive_posterior} simplify to a standard GP posterior.
 The MDGP is implicitly incentivized to explain the data using only one mode if possible through the likelihood term of the inferred $\mat{a_n}$ in \cref{eq:variational_bound}.
 At $x = 10$ both the two modes and the assignment processes start reverting to their respective priors away from the data.


 \subsection{Robust Regression}
 \label{subsec:choicenet}
 %
 \begin{figure*}[t]
 \centering
 \captionof{table}{
 \label{tab:choicenet}
 Results on the ChoiceNet data set.
 The gray part of the table shows the results from~\parencite{choi_choicenet_2018}.
 We report RMSE comparable to the previous results together with MLL.
 Both are calculated based on a test set of 1000 equally spaced samples of the noiseless underlying function.
 }%
 \newcolumntype{Y}{>{\centering\arraybackslash}X}%
 \newcolumntype{Z}{>{\columncolor{sStone!33}\centering\arraybackslash}X}%
 \begin{tabularx}{\linewidth}{rYYZZZZZZ}
 \toprule
 Outliers & MDGP (MLL) & MDGP (RMSE) & CN & MDN & MLP & GPR & LGPR & RGPR \\
 \midrule
 0\,\% & 2.86 & \textbf{0.008} & 0.034 & 0.028 & 0.039 & \textbf{0.008} & 0.022 & 0.017 \\
 20\,\% & 2.71 & \textbf{0.008} & 0.022 & 0.087 & 0.413 & 0.280 & 0.206 & 0.013 \\
 40\,\% & 2.12 & \textbf{0.005} & 0.018 & 0.565 & 0.452 & 0.447 & 0.439 & 1.322 \\
 60\,\% & 0.874 & 0.031 & \textbf{0.023} & 0.645 & 0.636 & 0.602 & 0.579 & 0.738 \\
 80\,\% & 0.126 & 0.128 & \textbf{0.084} & 0.778 & 0.829 & 0.779 & 0.777 & 1.523 \\
 \bottomrule
 \end{tabularx}
 \\[.5\baselineskip]
 \begin{subfigure}{.495\linewidth}
 \centering
 \includestandalone{figures/choicenet_joint_40}
 \end{subfigure}
 \hfill
 \begin{subfigure}{.495\linewidth}
 \centering
 \includestandalone{figures/choicenet_attrib_40}
 \end{subfigure}
 \\
 \begin{subfigure}{.495\linewidth}
 \centering
 \includestandalone{figures/choicenet_joint}
 \end{subfigure}
 \hfill
 \begin{subfigure}{.495\linewidth}
 \centering
 \includestandalone{figures/choicenet_attrib}
 \end{subfigure}
 \captionof{figure}{
 \label{fig:choicenet}
 MDGP on the ChoiceNet data set with 40\,\% outliers (upper row) and 60\,\% outliers (lower row).
 We show the joint posterior (left) and assignment probabilities (right).
 The bimodal MDGP identifies the signal perfectly up to 40\,\% outliers.
 For 60\,\% outliers, some of the noise is interpreted as signal, but the latent function is still recovered.
 }
 \end{figure*}
 %
 Our second experiment applies MDGP to a onedimensional regression problem with uniformly distributed outliers in the training data.
 We use a task proposed by~\textcite{choi_choicenet_2018} where we sample inputs $x \in [3, 3]$ uniformly and apply the function $\Fun{f}{x} = (1  \delta)\Fun{\cos}{\sfrac{\pi}{2} \cdot x}\Fun{\exp}{(\sfrac{x}{2})^2} + \delta \cdot \epsilon$, where $\delta \sim \Fun{\Ber}{\lambda}$ and $\epsilon \sim \Fun{\Uni}{1, 3}$.
 That is, a fraction $\lambda$ of the training data, the outliers, are replaced by uniform noise.
 We sample a total of 1000 data points and use $25$ inducing points for every GP in our model.

 Every mode in our model can use a different kernel and therefore encode different prior assumptions.
 For robust regression, we use a bimodal model consisting of a mode with a squared exponential kernel and one with a white noise kernel.
 To avoid pathological solutions for high outlier ratios, we add a prior to the likelihood variance of the first mode, which encodes our assumption that there actually is a signal in the training data.

 The model proposed in~\parencite{choi_choicenet_2018}, called ChoiceNet (CN), is a specific neural network structure and inference algorithm to deal with corrupted data.
 In their work, they compare their approach to a standard multilayer perceptron (MLP), a mixture density network (MDN), standard Gaussian process regression (GPR), leveraged Gaussian process regression (LGPR)~\parencite{choi_robust_2016}, and infinite mixtures of Gaussian processes (RGPR)~\parencite{rasmussen_infinite_2002}.
 \Cref{tab:choicenet} shows results for outlier rates varied from 0\,\% to 80\,\%.
 Besides the root mean squared error (RMSE), we also report the mean test log likelihood (MLL) of the mode representing the target function in our model.

 Up to an outlier rate of 40\,\%, our model correctly identifies the outliers and ignores them, resulting in a predictive posterior of the signal mode equivalent to standard GP regression without outliers.
 In the special case of 0\,\% outliers, MDGP correctly identifies that the mode modelling the noise is not necessary and disables it, thereby simplifying itself to standard GP regression.
 For high outlier rates, strong prior knowledge about the signal would be required to still identify it perfectly.
 \Cref{fig:choicenet} shows the posterior for an outlier rate of 60\,\%.
 While the function has still been identified well, some of the noise is also explained using this mode, thereby introducing slight errors in the predictions.


 \subsection{Mixed CartPole Systems}
 \label{subsec:cartpole}
 \begin{table*}[t]
 \centering
 \caption{
 \label{tab:cartpole}
 Results on the cartpole data set.
 We report mean log likelihoods with standard error for ten runs.
 }%
 \sisetup{
 tableformat=1.3(3),
 tablenumberalignment=center,
 separateuncertainty,
 % tablealignuncertainty,
 tablefiguresuncertainty=1,
 detectweight,
 }
 \newcolumntype{H}{>{\setbox0=\hbox\bgroup}c<{\egroup}@{}}
 \footnotesize
 \setlength{\tabcolsep}{1pt}
 \begin{tabular}{HlSSSSSS}
 \toprule
 & & \multicolumn{2}{c}{Mixed} & \multicolumn{2}{c}{Default only} & \multicolumn{2}{c}{Shortpole only} \\
 \cmidrule(lr){34} \cmidrule(lr){56} \cmidrule(lr){78}
 Runs & & {Train} & {Test} & {Train} & {Test} & {Train} & {Test} \\
 \midrule
 10 & MDGP & \bfseries 0.575 \pm 0.013 & \bfseries 0.521 \pm 0.009 & 0.855 \pm 0.002 & 0.844 \pm 0.002 & 0.686 \pm 0.009 & 0.602 \pm 0.005 \\
 % 10 & UMDGP & 0.472 \pm 0.002 & 0.425 \pm 0.002 & {\textemdash} & {\textemdash} & {\textemdash} & {\textemdash} \\
 10 & MDGP 2 & 0.548 \pm 0.012 & \bfseries 0.519 \pm 0.008 & 0.851 \pm 0.003 & 0.859 \pm 0.001 & 0.673 \pm 0.013 & 0.599 \pm 0.011 \\
 10 & MDGP 3 & 0.527 \pm 0.004 & 0.491 \pm 0.003 & 0.858 \pm 0.002 & 0.852 \pm 0.002 & 0.624 \pm 0.011 & 0.545 \pm 0.012 \\
 10 & MDGP 4 & 0.517 \pm 0.006 & 0.485 \pm 0.003 & 0.858 \pm 0.001 & 0.852 \pm 0.002 & 0.602 \pm 0.011 & 0.546 \pm 0.010 \\
 10 & MDGP 5 & 0.535 \pm 0.004 & 0.506 \pm 0.005 & 0.851 \pm 0.003 & 0.851 \pm 0.003 & 0.662 \pm 0.009 & 0.581 \pm 0.012 \\
 \addlinespace
 10 & BNN+LV & 0.519 \pm 0.005 & \bfseries 0.524 \pm 0.005 & {\textemdash} & {\textemdash} & {\textemdash} & {\textemdash} \\
 10 & GPR Mixed & 0.452 \pm 0.003 & 0.421 \pm 0.003 & {\textemdash} & {\textemdash} & {\textemdash} & {\textemdash} \\
 10 & GPR Default & {\textemdash} & {\textemdash} & \bfseries 0.873 \pm 0.001 & \bfseries 0.867 \pm 0.001 & 7.01 \pm 0.11 & 7.54 \pm 0.14 \\
 10 & GPR Short & {\textemdash} & {\textemdash} & 5.24 \pm 0.04 & 5.14 \pm 0.04 & \bfseries 0.903 \pm 0.003 & \bfseries 0.792 \pm 0.003 \\
 \bottomrule
 \end{tabular}
 \end{table*}
 Our third experiment is based on the cartpole benchmark for reinforcement learning as described by~\textcite{barto_neuronlike_1983} and implemented in OpenAI Gym~\parencite{brockman_openai_2016}.
 In this benchmark, the objective is to apply forces to a cart moving on a frictionless track to keep a pole, which is attached to the cart via a joint, in an upright position.
 We consider the regression problem of predicting the change of the pole's angle given the current state of the cart and the action applied.
 The current state of the cart consists of the cart's position and velocity and the pole's angular position and velocity.
 To simulate a dynamical system with changing states of operation our experimental setup is to sample trajectories from two different cartpole systems and merging the resulting data into one training set.
 The task is not only to learn a model which explains this data well, but to recover the multimodalities introduced by the different system configurations.
 This task is important in reinforcement learning settings where we study systems that go through state changes which need to be inferred.

 We sample trajectories from the system by initializing the pole in an almost upright position and then applying 10 uniform random actions.
 We add Gaussian noise $\epsilon \sim \Gaussian*{0, 0.01^2}$ to the observed angle changes.
 To increase the nonlinearity of the dynamics, we apply the action for five consecutive time steps and allow the pole to swing freely instead of ending the trajectory after reaching a specific angle.
 The data set consists of 500 points sampled from the \emph{default} cartpole system and another 500 points sampled from a \emph{shortpole} cartpole system in which we halve the mass of the pole to 0.05 and shorten the pole to 0.1, a tenth of its default length.
 This shortpole system is more unstable and the pole reaches higher speeds.
 Predictions in this system therefore have to take the multimodality into account, as mean predictions between the more stable and the more unstable system can never be observed.
 We consider three test sets, one sampled from the default system, one sampled from the shortpole system, and a mixture of the two.
 They are generated by sampling trajectories with an aggregated size of 5000 points from each system for the first two sets and their concatenation for the mixed set.

 For this data set, we use squared exponential kernels for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and 100 inducing points in every GP.
 We evaluate the performance of deep GPs with up to 5 layers and squared exponential kernels as models for the different modes.
 As described in~\parencite{salimbeni_doubly_2017}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.
 We compare our models with threelayer reluactivated Bayesian neural networks with added latent variables (BNN+LV) as introduced by~\textcite{depeweg_learning_2016}.
 These latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for modelbased reinforcement learning~\parencite{depeweg_decomposition_2018}.
 We also compare to three kinds of sparse GPs~\parencite{hensman_scalable_2015}.
 They are trained on the mixed data set, the default system and the shortpole system respectively.

 \Cref{tab:cartpole} shows mean training and test log likelihoods and their standard error over ten runs for these models.
 The \emph{mixed}column corresponds to training and test log likelihoods for a standard regression problem, which in this case is a bimodal one.
 The GPR model trained on the mixed data set shows the worst performance, since its predictions are single Gaussians spanning both modes.
 Additionally, the mean prediction is approximately the mean of the two modes and is physically implausible.
 Both the BNN+LV and MDGP models perform substantially better as they can model the bimodality.
 BNN+LV assumes continuous latent variables and a bimodal distribution can be recovered by approximately marginalizing these latent variables via sampling.
 The predictive posterior of unknown shape is approximated using a mixture of many Gaussians.
 Compared to the shallow MDGP, the prior of BNN+LV is harder to interpret, as the MDGP's generative process produces a mixture of two Gaussians representing the two modes in the data.
 Adding more layers to the MDGP model leads to more expressive models whose priors on the different modes become less informative.
 For this cartpole data, twolayer deep GPs seem to be a good compromise between model expressiveness and the strength of the prior, as they are best able to separate the data into the two separate dynamics.

 On the \emph{mixed} test set, MDGP and BNN+LV both show comparable likelihoods.
 However, the MDGP is a more expressive model, whose different modes can be evaluated further.
 The results in the \emph{default only} and \emph{shortpole only} columns compare training and test likelihoods on the parts of the training and test sets corresponding to these systems respectively.
 We calculate these values by evaluating both modes separately on the data sets and reporting the higher likelihood.
 We compare these results with sparse GP models trained only on the respective systems.
 The two modes of MDGP reliably separate the two different systems.
 In fact, the mode corresponding to the \emph{default} system in the twolayer MDGP shows equal test performance to the corresponding GPR model trained only on data of this mode.
 The \emph{default} and \emph{shortpole} systems are sufficiently different such that the sparse GPs trained on only one of the two sets performs very poorly.
 Out of these two systems, the \emph{shortpole} system is more complicated and harder to learn.
 The second mode of MDGP still recovers an adequate model.
 Given the fact that the two modes of MDGP model the two system dynamics in the original data, sampling trajectories from them results in physically plausible data, which is not possible with a sparse GP or BNN+LV model.


 \section{Conclusion}
 \label{sec:conclusion}
 We have presented a fully Bayesian model for the data association problem.
 Our model factorises the observed data into a set of independent processes and provides a model over both the processes and their association to the observed data.
 The data association problem is inherently illconstrained and requires significant assumptions to recover a solution.
 In this paper, we make use of interpretable Gaussian process priors allowing global a priori information to be included into the model.
 Importantly, our model is able to exploit information both about the underlying functions and the association structure.
 We have derived a principled approximation to the marginal likelihood which allows us to perform inference for flexible hierarchical processes.
 In future work, we would like to incorporate the proposed model in a reinforcement learning scenario where we study a dynamical system with state changes.


 \printbibliography

 \end{document}
