@ -387,12 +387,31 @@ This extended bound thus has complexity $\Fun*{\Oh}{NM^2LK}$ to evaluate in the

\bottomrule

\end{tabularx}

\end{table}

In this section we investigate the behavior of the DAGP model in multiple regression settings.

First, we show how prior knowledge about the different generative processes can be used to separate a signal from unrelated noise.

Second, we apply the DAGP to a multimodal data set and showcase how the different components of the model interact to identify how many modes are necessary to explain the data.

Finally, we investigate a data set which contains observations of two independent dynamical systems mixed together and show how the DAGP can recover information about both systems and infer the state variable separating the systems.

In this section we investigate the behavior of the DAGP model.

We use an implementation of DAGP in TensorFlow~\parencite{tensorflow2015-whitepaper} based on GPflow~\parencite{matthews_gpflow_2017} and the implementation of DSVI~\parencite{salimbeni_doubly_2017}.

\Cref{tab:model_capabilities} compares qualitative properties of DAGP and related work.

Both standard Gaussian process regression (GPR) and multi-layer perceptrons (MLP) do not impose structure which enables the model to handle multi-modal data.

Mixture density networks (MDN)~\parencite{bishop_mixture_1994} the infinite mixtures of Gaussian processes (RGPR)~\parencite{rasmussen_infinite_2002} model are yield multi-modal posteriors through mixtures with many components but do not solve an association problem.

Similarly, Bayesian neural networks with added latent variables (BNN+LV)~\textcite{depeweg_learning_2016} represent such a mixture through a continuous latent variable.

Both the overlapping mixtures of Gaussian processes (OMGP)~\parencite{lazaro-gredilla_overlapping_2012} model and DAGP explicitly model the data association problem and yield independent models for the different generating processes.

However, OMGP assumes global relevance of the different modes.

In contrast, DAGP infers a spacial posterior of this relevance.

We evaluate or model on three problems to highlight advantages of the explicit structure of DAGP:

\emph{Interpretable priors give structure to ill-posed association problems.}

In \cref{subsec:choicenet} we consider a noise separation problem, where a signal of interest is disturbed with uniform noise.

To solve this problem, assumptions about what constitutes a signal are needed.

The hierarchical structure of DAGP allows us to formulate independent and interpretable priors on the noise and signal processes.

\emph{Predictive associations represent knowledge about the relevance of generative processes.}

In \cref{subsec:semi_bimodal} we investigate the implicit incentive of DAGP to explain data using as few processes as possible.

Additional to a joint posterior explaining the data, DAGP also gives insight into the relative importance of the different processes in different parts of the input space.

DAGP is able to explicitly recover the changing number of modes in a data set.

\emph{Separate models for independent generating processes avoid model pollution.}

In \cref{subsec:cartpole} we simulate a system with multiple operational regimes via mixed observations of two different cart-pole systems.

DAGP successfully learns an informative joint posterior by solving the underlying association problem.

We show that the DAGP posterior contains two separate models for the two original operational regimes.

\subsection{Noise Separation}

@ -457,7 +476,7 @@ We use an implementation of DAGP in TensorFlow~\parencite{tensorflow2015-whitepa

}

\end{figure}

%

We begin with an experiment based on a noise separation problem.

We consider an experiment based on a noise separation problem.

We apply DAGP to a one-dimensional regression problem with uniformly distributed asymmetric outliers in the training data.

We use a task proposed by~\textcite{choi_choicenet_2018} where we sample $x \in[-3, 3]$ uniformly and apply the function $\Fun{f}{x}=(1-\delta)(\Fun{\cos}{\sfrac{\pi}{2}\cdot x}\Fun{\exp}{-(\sfrac{x}{2})^2}+\gamma)+\delta\cdot\epsilon$, where $\delta\sim\Fun{\Ber}{\lambda}$, $\epsilon\sim\Fun{\Uni}{-1, 3}$ and $\gamma\sim\Gaussian{0, 0.15^2}$.

That is, a fraction $\lambda$ of the training data, the outliers, are replaced by asymmetric uniform noise.

@ -469,14 +488,16 @@ This encodes the problem statement that every data point is either part of the s

To avoid pathological solutions for high outlier ratios, we add a prior to the likelihood variance of the first process, which encodes our assumption that there actually is a signal in the training data.

The model proposed in~\parencite{choi_choicenet_2018}, called ChoiceNet (CN), is a specific neural network structure and inference algorithm to deal with corrupted data.

In their work, they compare their approach to a standard multi-layer perceptron (MLP), a mixture density network (MDN), standard Gaussian process regression (GPR), leveraged Gaussian process regression (LGPR)~\parencite{choi_robust_2016}, and infinite mixtures of Gaussian processes (RGPR)~\parencite{rasmussen_infinite_2002}.

In their work, they compare their approach to the MLP, MDN, GPR, and RGPR models.

We add experiments for both DAGP and OMGP.

\Cref{tab:choicenet} shows results for outlier rates varied from 0\,\% to 80\,\%.

Besides the root mean squared error (RMSE), we also report the mean test log likelihood (MLL) of the process representing the signal in our model.

Besides the root mean squared error (RMSE) reported in~\parencite{choi_choicenet_2018}, we also report the mean test log likelihood (MLL).

Up to an outlier rate of 40\,\%, our model correctly identifies the outliers and ignores them, resulting in a predictive posterior of the signal equivalent to standard GP regression without outliers.

In the special case of 0\,\% outliers, DAGP correctly identifies that the process modelling the noise is not necessary and disables it, thereby simplifying itself to standard GP regression.

For high outlier rates, strong prior knowledge about the signal would be required to still identify it perfectly.

\Cref{fig:choicenet} shows the posterior for an outlier rate of 60\,\%.

Since we can encode the same prior knowledge about the signal and noise processes in both OMGP and DAGP, the results of the two models are comparable:

For low outlier rates, they correctly identify the outliers and ignore them, resulting in a predictive posterior of the signal equivalent to standard GP regression without outliers.

In the special case of 0\,\% outliers, the models correctly identify that the process modelling the noise is not necessary, thereby simplifying to standard GP regression.

For high outlier rates, stronger prior knowledge about the signal is required to still identify it perfectly.

\Cref{fig:choicenet} shows the DAGP posterior for an outlier rate of 60\,\%.

While the function has still been identified well, some of the noise is also explained using this process, thereby introducing slight errors in the predictions.

@ -525,6 +546,9 @@ At $x = -10$ the inferred modes and assignment processes start reverting to thei

\label{tab:cartpole}

Results on the cart-pole data set.

We report mean log likelihoods with standard error for ten runs.

The upper results are obtained by training the model on the mixed data set and evaluating it jointly (left) on multi-modal predictions.

Additionally, we evaluate the two inferred sub-models for the default system (center) and short-pole system (right) if possible.

As a baseline comparison is given by separate standard GPR models trained on the mixed set, the default set and the short pole set respectively.

}%

\sisetup{

table-format=-1.3(3),

@ -542,18 +566,19 @@ At $x = -10$ the inferred modes and assignment processes start reverting to thei

@ -576,34 +601,26 @@ They are generated by sampling trajectories with an aggregated size of 5000 poin

For this data set, we use squared exponential kernels for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and 100 inducing points in every GP.

We evaluate the performance of deep GPs with up to 3 layers and squared exponential kernels as models for the different functions.

As described in~\parencite{salimbeni_doubly_2017}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.

We compare our models with three-layer relu-activated Bayesian neural networks with added latent variables (BNN+LV) as introduced by~\textcite{depeweg_learning_2016}.

These latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for model-based reinforcement learning~\parencite{depeweg_decomposition_2018}.

We also compare to three kinds of sparse GPs~\parencite{hensman_scalable_2015}.

They are trained on the mixed data set, the default system and the short-pole system respectively.

\Cref{tab:cartpole} shows mean training and test log likelihoods and their standard error over ten runs for these models.

The \emph{mixed}-column corresponds to training and test log likelihoods for a standard regression problem, which in this case is a bimodal one.

The GPR model trained on the mixed data set shows the worst performance, since its predictions are single Gaussians spanning both system states.

Additionally, the mean prediction is approximately the mean of the two states and is physically implausible.

Both the BNN+LV and DAGP models perform substantially better as they can model the bimodality.

BNN+LV assumes continuous latent variables and a bimodal distribution can be recovered by approximately marginalizing these latent variables via sampling.

The predictive posterior of unknown shape is approximated using a mixture of many Gaussians.

Compared to the shallow DAGP, the prior of BNN+LV is harder to interpret, as the DAGP's generative process produces a mixture of two Gaussians representing the two processes in the data.

Adding more layers to the DAGP model leads to more expressive models whose priors on the different processes become less informative.

For this cart-pole data, two-layer deep GPs seem to be a good compromise between model expressiveness and the strength of the prior, as they are best able to separate the data into the two separate dynamics.

On the \emph{mixed} test set, DAGP and BNN+LV both show comparable likelihoods.

However, the DAGP is a more expressive model, whose different components can be evaluated further.

The results in the \emph{default only} and \emph{short-pole only} columns compare training and test likelihoods on the parts of the training and test sets corresponding to these systems respectively.

We calculate these values by evaluating both functions separately on the data sets and reporting the higher likelihood.

We compare these results with sparse GP models trained only on the respective systems.

The two functions of DAGP reliably separate the two different systems.

In fact, the function corresponding to the \emph{default} system in the two-layer DAGP shows equal test performance to the corresponding GPR model trained only on data of this system.

The \emph{default} and \emph{short-pole} systems are sufficiently different such that the sparse GPs trained on only one of the two sets performs very poorly.

Out of these two systems, the \emph{short-pole} system is more complicated and harder to learn due to the higher instability of the pole.

The second function of DAGP still recovers an adequate model.

Given the fact that the two functions of DAGP model the two system dynamics in the original data, sampling trajectories from them results in physically plausible data, which is not possible with a sparse GP or BNN+LV model.

As described in~\parencite{salimbeni_doubly_2017,kaiser_bayesian_2018}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.

We compare our models with OMGP and three-layer relu-activated Bayesian neural networks with added latent variables (BNN+LV).

The latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for model-based reinforcement learning~\parencite{depeweg_decomposition_2018}.

We also compare to three kinds of sparse GPs (GPR)~\parencite{hensman_scalable_2015}.

They are trained on the mixed data set, the default system and the short-pole system respectively and serve as a baseline-comparison as these models cannot handle multi-modal data.

\Cref{tab:cartpole} shows results for ten runs of these models.

The GPR model predicts a unimodal posterior for the mixed data set which covers both systems.

Its mean prediction is approximately the mean of the two regimes and is physically implausible.

The DAGP and BNN+LV models yield informative multi-modal predictions with comparable performance.

\todo{How to say this nicely?}In our setup, OMGP could not successfully solve the data association problem and thus does not produce a good joint posterior.

A possible explanation is that OMGP's inference scheme is optimized for ordered one-dimensional training sets, which is not the case in the 4D cart-pole problem.

As BNN+LV does not explicitly solve the data association problem, the model does not yield sub-models for the two different systems.

OMGP and DAGP yield such sub-models which can independently be used for predictions in the default or short-pole systems.

Samples drawn from these models can be used to generate physically plausible trajectories in the respective system.

OMGP fails to model the short-pole system but does yield a viable model for the default system which evolves more slowly due to higher torque and is therefore easier to learn.

In contrast, the two sub-models inferred by DAGP perform well on their respective systems, showing that DAGP reliably solves the data association problem and successfully avoids model pollution by separating the two systems well.

Given this separation, shallow and deep models for the two modes show comparable performance.

The more expressive deep GPs model the default system slightly better while sacrificing performance on the more difficult short-pole system.