
@@ 161,8 +161,8 @@ Interdependencies between the data points are introduced through the Gaussian pr

161

161


162

162

The priors for the $f^{\pix{k}}$ can be chosen independently to encode different prior assumptions about the modes.

163

163

In \cref{subsec:choicenet} we use different kernels to separate a nonlinear signal from a noise process.

164


Going further, we can also use deep Gaussian processes as priors for the $f^{\pix{k}}$ \parencite{damianou_deep_2013, salimbeni_doubly_2017}.

165


Since many real word systems are inherently hierarchical, prior knowledge can often be formulated more easily using composite functions \parencite{kaiser_bayesian_2017}.


164

+Going further, we can also use deep Gaussian processes as priors for the $f^{\pix{k}}$~\parencite{damianou_deep_2013, salimbeni_doubly_2017}.


165

+Since many real word systems are inherently hierarchical, prior knowledge can often be formulated more easily using composite functions~\parencite{kaiser_bayesian_2017}.

166

166


167

167


168

168

\section{Variational Approximation}


@@ 175,10 +175,10 @@ This bound can be sampled efficiently and allows us to optimize both the models

175

175


176

176

\subsection{Variational Lower Bound}

177

177

\label{subsec:lower_bound}

178


As first introduced by \textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.


178

+As first introduced by~\textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.

179

179

We collect them as $\mat{Z} = \Set*{\mat{Z^{\pix{k}}}, \mat{Z_\alpha^{\pix{k}}}}_{k=1}^K$ and $\mat{U} = \Set*{\mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K$.

180

180

Taking the function $f^{\pix{k}}$ and its corresponding GP as an example, the inducing variables $\mat{u^{\pix{k}}}$ are jointly Gaussian with the latent function values $\mat{F^{\pix{k}}}$ of the observed data by the definition of GPs.

181


We follow \parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.


181

+We follow~\parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.

182

182

This formulation introduces the set $\Set*{\mat{Z^{\pix{k}}}, \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$ of variational parameters indicated in~\cref{fig:dynamic_graphical_model}.

183

183

To simplify notation we drop the dependency on $\mat{Z}$ in the following.

184

184



@@ 206,7 +206,7 @@ Next, we discuss how to handle correlations between the different modes and the

206

206

Given a set of assignments $\mat{A}$, this factorization along the data points is preserved in our model due to the assumed independence of the different modes in~\cref{eq:true_marginal_likelihood}.

207

207

The independence is lost if the assignments are unknown.

208

208

In this case, both the (a priori independent) assignment processes and the modes influence each other through data with unclear assignments.

209


Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by \textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.


209

+Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by~\textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.

210

210

That is, our variational posterior takes the factorized form,

211

211

\begin{align}

212

212

\begin{split}


@@ 247,10 +247,10 @@ This bound has complexity $\Fun*{\Oh}{NM^2K}$ to evaluate.

247

247


248

248

\subsection{Optimization of the Lower Bound}

249

249

\label{subsec:computation}

250


An important property of the variational bound for DSVI described in \parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.

251


Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations \parencite{kingma_variational_2015,rezende_stochastic_2014}.


250

+An important property of the variational bound for DSVI described in~\parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.


251

+Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations~\parencite{kingma_variational_2015,rezende_stochastic_2014}.

252

252


253


While it would not be necessary to sample from the different modes, since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically \parencite{hensman_gaussian_2013}, we apply this idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.


253

+While it would not be necessary to sample from the different modes, since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically~\parencite{hensman_gaussian_2013}, we apply this idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.

254

254

For $\mat{\alpha}$, the analytical propagation of uncertainties through the $\softmax$ renormalization and multinomial likelihoods is intractable but can easily be evaluated using sampling.

255

255


256

256

We optimize $\Ell_{\text{MDGP}}$ to simultaneously recover maximum likelihood estimates of the hyperparameters $\mat{\theta}$, the variational parameters $\Set*{\mat{Z}, \mat{m}, \mat{S}}$, and assignments $\mat{A}$.


@@ 264,13 +264,13 @@ Most importantly, explaining data points as mixtures of modes can substantially

264

264

Because of this, special care must be taken during optimization to enforce the sparsity of $\Variat*{\mat{a_n}}$.

265

265


266

266

To avoid this problem, we propose using a different relaxation based on additional stochasticity.

267


Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by \textcite{maddison_concrete_2016}, whose parameters are given by $\Variat*{\mat{a_n}}$.


267

+Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by~\textcite{maddison_concrete_2016}, whose parameters are given by $\Variat*{\mat{a_n}}$.

268

268

Based on a temperature parameter $\lambda$, a Concrete random variable enforces sparsity but is also continuous and yields informative gradients using automatic differentiation.

269

269

Samples from a Concrete random variable are unit vectors and for $\lambda \to 0$ their distribution approaches a discrete distribution.

270

270


271

271

Our approximate evaluation of the bound in \cref{eq:variational_bound} during optimization has multiple sources of stochasticity, all of which are unbiased.

272

272

First, we approximate the expectations using Monte Carlo samples $\mat{\hat{f}_n^{\pix{k}}}$, $\mat{\hat{\alpha}_n^{\pix{k}}}$, and $\mat{\hat{a}_n}$.

273


And second, the factorization of the bound along the data allows us to use minibatches for optimization \parencite{salimbeni_doubly_2017, hensman_gaussian_2013}.


273

+And second, the factorization of the bound along the data allows us to use minibatches for optimization~\parencite{salimbeni_doubly_2017, hensman_gaussian_2013}.

274

274


275

275


276

276

\subsection{Approximate Predictions}


@@ 284,7 +284,7 @@ Predictions for a test location $\mat{x_\ast}$ are mixtures of $K$ independent G

284

284

&\approx \sum_{k=1}^K \hat{a}_\ast^{\pix{k}} \mat{\hat{f}_\ast^{\pix{k}}}.

285

285

\end{split}

286

286

\end{align}

287


The predictive posteriors of the $K$ modes $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically \parencite{hensman_gaussian_2013}.


287

+The predictive posteriors of the $K$ modes $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically~\parencite{hensman_gaussian_2013}.

288

288

Samples from the predictive density over $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ can be obtained by sampling from the Gaussian process posteriors $\Variat*{\mat{\alpha_\ast^{\pix{k}}} \given \mat{x_\ast}}$ and renormalizing the resulting vector $\mat{\alpha_\ast}$ using the $\softmax$function.

289

289

The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model's belief about how many and which of the $K$ modes are relevant at the test location $\mat{x_\ast}$.

290

290



@@ 333,7 +333,7 @@ The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model

333

333

For clarity, we have described the variational bound in terms of a shallow GP.

334

334

However, as long as their variational bound can be efficiently sampled, any model can be used in place of shallow GPs for the $f^{\pix{k}}$.

335

335

Since our approximation is based on DSVI for deep Gaussian processes, an extension to deep GPs is straightforward.

336


Analogously to \parencite{salimbeni_doubly_2017}, our new prior assumption about the $\nth{k}$ latent function values $\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}}$ is given by


336

+Analogously to~\parencite{salimbeni_doubly_2017}, our new prior assumption about the $\nth{k}$ latent function values $\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}}$ is given by

337

337

\begin{align}

338

338

\begin{split}

339

339

\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}} = \prod_{l=1}^L \Prob*{\mat{F_l^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}} \mat{F_{l1}^{\prime\pix{k}}}, \mat{Z_l^{\prime\pix{k}}}}


@@ 352,7 +352,7 @@ We collect the latent multilayer function values as $\mat{F^\prime} = \Set{\mat

352

352

\end{split}

353

353

\end{align}

354

354

where we identify $\mat{f_n^{\prime\pix{k}}} = \mat{f_{n,L}^{\prime\pix{k}}}$.

355


As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on depends only on the $\nth{n}$ marginal of all layers above sampling from them remains straightforward \parencite{salimbeni_doubly_2017}.


355

+As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on depends only on the $\nth{n}$ marginal of all layers above sampling from them remains straightforward~\parencite{salimbeni_doubly_2017}.

356

356

The marginal is given by

357

357

\begin{align}

358

358

\begin{split}


@@ 386,7 +386,7 @@ First, we apply the MDGP to an artificial data set and showcase how the differen

386

386

Second, we show how different priors on the different modes can be used to separate a signal from unrelated noise.

387

387

Finally, we investigate a data set which contains observations of two independent dynamical systems mixed together and show how the MDGP can recover information about both systems and infer the state variable separating the systems.

388

388


389


We use an implementation of MDGP in TensorFlow \parencite{tensorflow2015whitepaper} based on GPflow \parencite{matthews_gpflow_2017} and the implementation of DSVI \parencite{salimbeni_doubly_2017}.


389

+We use an implementation of MDGP in TensorFlow~\parencite{tensorflow2015whitepaper} based on GPflow~\parencite{matthews_gpflow_2017} and the implementation of DSVI~\parencite{salimbeni_doubly_2017}.

390

390


391

391


392

392

\subsection{Artificial Data Set}


@@ 413,7 +413,7 @@ At $x = 10$ both the two modes and the assignment processes start reverting to

413

413

\captionof{table}{

414

414

\label{tab:choicenet}

415

415

Results on the ChoiceNet data set.

416


 The gray part of the table shows the results from \parencite{choi_choicenet_2018}.


416

+ The gray part of the table shows the results from~\parencite{choi_choicenet_2018}.

417

417

We report RMSE comparable to the previous results together with MLL.

418

418

Both are calculated based on a test set of 1000 equally spaced samples of the noiseless underlying function.

419

419

}%


@@ 460,7 +460,7 @@ At $x = 10$ both the two modes and the assignment processes start reverting to

460

460

\end{figure*}

461

461

%

462

462

Our second experiment applies MDGP to a onedimensional regression problem with uniformly distributed outliers in the training data.

463


We use a task proposed by \textcite{choi_choicenet_2018} where we sample inputs $x \in [3, 3]$ uniformly and apply the function $\Fun{f}{x} = (1  \delta)\Fun{\cos}{\sfrac{\pi}{2} \cdot x}\Fun{\exp}{(\sfrac{x}{2})^2} + \delta \cdot \epsilon$, where $\delta \sim \Fun{\Ber}{\lambda}$ and $\epsilon \sim \Fun{\Uni}{1, 3}$.


463

+We use a task proposed by~\textcite{choi_choicenet_2018} where we sample inputs $x \in [3, 3]$ uniformly and apply the function $\Fun{f}{x} = (1  \delta)\Fun{\cos}{\sfrac{\pi}{2} \cdot x}\Fun{\exp}{(\sfrac{x}{2})^2} + \delta \cdot \epsilon$, where $\delta \sim \Fun{\Ber}{\lambda}$ and $\epsilon \sim \Fun{\Uni}{1, 3}$.

464

464

That is, a fraction $\lambda$ of the training data, the outliers, are replaced by uniform noise.

465

465

We sample a total of 1000 data points and use $25$ inducing points for every GP in our model.

466

466



@@ 468,8 +468,8 @@ Every mode in our model can use a different kernel and therefore encode differen

468

468

For robust regression, we use a bimodal model consisting of a mode with a squared exponential kernel and one with a white noise kernel.

469

469

To avoid pathological solutions for high outlier ratios, we add a prior to the likelihood variance of the first mode, which encodes our assumption that there actually is a signal in the training data.

470

470


471


The model proposed in \parencite{choi_choicenet_2018}, called ChoiceNet (CN), is a specific neural network structure and inference algorithm to deal with corrupted data.

472


In their work, they compare their approach to a standard multilayer perceptron (MLP), a mixture density network (MDN), standard Gaussian process regression (GPR), leveraged Gaussian process regression (LGPR) \parencite{choi_robust_2016}, and infinite mixtures of Gaussian processes (RGPR) \parencite{rasmussen_infinite_2002}.


471

+The model proposed in~\parencite{choi_choicenet_2018}, called ChoiceNet (CN), is a specific neural network structure and inference algorithm to deal with corrupted data.


472

+In their work, they compare their approach to a standard multilayer perceptron (MLP), a mixture density network (MDN), standard Gaussian process regression (GPR), leveraged Gaussian process regression (LGPR)~\parencite{choi_robust_2016}, and infinite mixtures of Gaussian processes (RGPR)~\parencite{rasmussen_infinite_2002}.

473

473

\Cref{tab:choicenet} shows results for outlier rates varied from 0\,\% to 80\,\%.

474

474

Besides the root mean squared error (RMSE), we also report the mean test log likelihood (MLL) of the mode representing the target function in our model.

475

475



@@ 520,7 +520,7 @@ While the function has still been identified well, some of the noise is also exp

520

520

\bottomrule

521

521

\end{tabular}

522

522

\end{table*}

523


Our third experiment is based on the cartpole benchmark for reinforcement learning as described by \textcite{barto_neuronlike_1983} and implemented in OpenAI Gym \parencite{brockman_openai_2016}.


523

+Our third experiment is based on the cartpole benchmark for reinforcement learning as described by~\textcite{barto_neuronlike_1983} and implemented in OpenAI Gym~\parencite{brockman_openai_2016}.

524

524

In this benchmark, the objective is to apply forces to a cart moving on a frictionless track to keep a pole, which is attached to the cart via a joint, in an upright position.

525

525

We consider the regression problem of predicting the change of the pole's angle given the current state of the cart and the action applied.

526

526

The current state of the cart consists of the cart's position and velocity and the pole's angular position and velocity.


@@ 539,10 +539,10 @@ They are generated by sampling trajectories with an aggregated size of 5000 poin

539

539


540

540

For this data set, we use squared exponential kernels for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and 100 inducing points in every GP.

541

541

We evaluate the performance of deep GPs with up to 5 layers and squared exponential kernels as models for the different modes.

542


As described in \parencite{salimbeni_doubly_2017}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.

543


We compare our models with threelayer reluactivated Bayesian neural networks with added latent variables (BNN+LV) as introduced by \textcite{depeweg_learning_2016}.

544


These latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for modelbased reinforcement learning \parencite{depeweg_decomposition_2018}.

545


We also compare to three kinds of sparse GPs \parencite{hensman_scalable_2015}.


542

+As described in~\parencite{salimbeni_doubly_2017}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.


543

+We compare our models with threelayer reluactivated Bayesian neural networks with added latent variables (BNN+LV) as introduced by~\textcite{depeweg_learning_2016}.


544

+These latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for modelbased reinforcement learning~\parencite{depeweg_decomposition_2018}.


545

+We also compare to three kinds of sparse GPs~\parencite{hensman_scalable_2015}.

546

546

They are trained on the mixed data set, the default system and the shortpole system respectively.

547

547


548

548

\Cref{tab:cartpole} shows mean training and test log likelihoods and their standard error over ten runs for these models.
