@ -528,10 +528,11 @@ The resulting data set $\D = \Set{\left( x, y_1 \right), \left( x, y_2 \right),

We use squared exponential kernels as priors for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and $25$ inducing points in every GP.

\Cref{fig:semi_bimodal} shows the posterior of an DAGP with $K =4$ modes applied to the data, which correctly identified the underlying functions.

Te figure shows the posterior belief about the assignments $\mat{A}$ and illustrates that DAGP recovered that it needs only three of the four available modes to explain the data.

The figure shows the posterior belief about the assignments $\mat{A}$ and illustrates that DAGP recovered that it needs only three of the four available modes to explain the data.

One of the modes is only assigned points in the interval $[0, 5]$ where the data is actually trimodal.

This separation is explicitly represented in the model via the assignment processes $\mat{\alpha}$.

This separation is explicitly represented in the model via the assignment processes $\mat{\alpha}$ (bottom panel in \cref{fig:semi_bimodal}).

\todo{Importantly we do not only cluster the data, we factorize the space.}

The model has disabled the mode $k =2$ in the complete input space and has learned that the mode $k =1$ is only relevant in the interval $[0, 5]$ where the three enabled modes each explain about a third of the data.

Outside this interval, the model has learned that one of the modes has about twice the assignment probability than the other one, thus correctly reconstructing the true generative process.

The DAGP is implicitly incentivized to explain the data using as few modes as possible through the likelihood term of the inferred $\mat{a_n}$ in \cref{eq:variational_bound}.

@ -574,8 +575,8 @@ At $x = -10$ the inferred modes and assignment processes start reverting to thei

@ -611,8 +612,9 @@ They are trained on the mixed data set, the default system and the short-pole sy

The GPR model predicts a unimodal posterior for the mixed data set which covers both systems.

Its mean prediction is approximately the mean of the two regimes and is physically implausible.

The DAGP and BNN+LV models yield informative multi-modal predictions with comparable performance.

\todo{How to say this nicely?}In our setup, OMGP could not successfully solve the data association problem and thus does not produce a good joint posterior.

A possible explanation is that OMGP's inference scheme is optimized for ordered one-dimensional training sets, which is not the case in the 4D cart-pole problem.

In our setup, OMGP could not successfully solve the data association problem and thus does not produce a useful joint posterior.

The OMGP's inference scheme is tailored to ordered one-dimensional problems.

It does not trivially translate to the 4D cart-pole problem.

As BNN+LV does not explicitly solve the data association problem, the model does not yield sub-models for the two different systems.

OMGP and DAGP yield such sub-models which can independently be used for predictions in the default or short-pole systems.