Browse Source

Add non-breaking spaces before citation

Markus Kaiser 3 months ago
parent
commit
3e12b05f71
1 changed files with 23 additions and 23 deletions
  1. 23
    23
      dynamic_dirichlet_deep_gp.tex

+ 23
- 23
dynamic_dirichlet_deep_gp.tex View File

@@ -161,8 +161,8 @@ Interdependencies between the data points are introduced through the Gaussian pr
161 161
 
162 162
 The priors for the $f^{\pix{k}}$ can be chosen independently to encode different prior assumptions about the modes.
163 163
 In \cref{subsec:choicenet} we use different kernels to separate a non-linear signal from a noise process.
164
-Going further, we can also use deep Gaussian processes as priors for the $f^{\pix{k}}$ \parencite{damianou_deep_2013, salimbeni_doubly_2017}.
165
-Since many real word systems are inherently hierarchical, prior knowledge can often be formulated more easily using composite functions \parencite{kaiser_bayesian_2017}.
164
+Going further, we can also use deep Gaussian processes as priors for the $f^{\pix{k}}$~\parencite{damianou_deep_2013, salimbeni_doubly_2017}.
165
+Since many real word systems are inherently hierarchical, prior knowledge can often be formulated more easily using composite functions~\parencite{kaiser_bayesian_2017}.
166 166
 
167 167
 
168 168
 \section{Variational Approximation}
@@ -175,10 +175,10 @@ This bound can be sampled efficiently and allows us to optimize both the models
175 175
 
176 176
 \subsection{Variational Lower Bound}
177 177
 \label{subsec:lower_bound}
178
-As first introduced by \textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.
178
+As first introduced by~\textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.
179 179
 We collect them as $\mat{Z} = \Set*{\mat{Z^{\pix{k}}}, \mat{Z_\alpha^{\pix{k}}}}_{k=1}^K$ and $\mat{U} = \Set*{\mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K$.
180 180
 Taking the function $f^{\pix{k}}$ and its corresponding GP as an example, the inducing variables $\mat{u^{\pix{k}}}$ are jointly Gaussian with the latent function values $\mat{F^{\pix{k}}}$ of the observed data by the definition of GPs.
181
-We follow \parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.
181
+We follow~\parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.
182 182
 This formulation introduces the set $\Set*{\mat{Z^{\pix{k}}}, \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$ of variational parameters indicated in~\cref{fig:dynamic_graphical_model}.
183 183
 To simplify notation we drop the dependency on $\mat{Z}$ in the following.
184 184
 
@@ -206,7 +206,7 @@ Next, we discuss how to handle correlations between the different modes and the
206 206
 Given a set of assignments $\mat{A}$, this factorization along the data points is preserved in our model due to the assumed independence of the different modes in~\cref{eq:true_marginal_likelihood}.
207 207
 The independence is lost if the assignments are unknown.
208 208
 In this case, both the (a priori independent) assignment processes and the modes influence each other through data with unclear assignments.
209
-Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by \textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.
209
+Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by~\textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.
210 210
 That is, our variational posterior takes the factorized form,
211 211
 \begin{align}
212 212
 \begin{split}
@@ -247,10 +247,10 @@ This bound has complexity $\Fun*{\Oh}{NM^2K}$ to evaluate.
247 247
 
248 248
 \subsection{Optimization of the Lower Bound}
249 249
 \label{subsec:computation}
250
-An important property of the variational bound for DSVI described in \parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.
251
-Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations \parencite{kingma_variational_2015,rezende_stochastic_2014}.
250
+An important property of the variational bound for DSVI described in~\parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.
251
+Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations~\parencite{kingma_variational_2015,rezende_stochastic_2014}.
252 252
 
253
-While it would not be necessary to sample from the different modes, since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically \parencite{hensman_gaussian_2013}, we apply this idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.
253
+While it would not be necessary to sample from the different modes, since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically~\parencite{hensman_gaussian_2013}, we apply this idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.
254 254
 For $\mat{\alpha}$, the analytical propagation of uncertainties through the $\softmax$ renormalization and multinomial likelihoods is intractable but can easily be evaluated using sampling.
255 255
 
256 256
 We optimize $\Ell_{\text{MDGP}}$ to simultaneously recover maximum likelihood estimates of the hyperparameters $\mat{\theta}$, the variational parameters $\Set*{\mat{Z}, \mat{m}, \mat{S}}$, and assignments $\mat{A}$.
@@ -264,13 +264,13 @@ Most importantly, explaining data points as mixtures of modes can substantially
264 264
 Because of this, special care must be taken during optimization to enforce the sparsity of $\Variat*{\mat{a_n}}$.
265 265
 
266 266
 To avoid this problem, we propose using a different relaxation based on additional stochasticity.
267
-Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by \textcite{maddison_concrete_2016}, whose parameters are given by $\Variat*{\mat{a_n}}$.
267
+Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by~\textcite{maddison_concrete_2016}, whose parameters are given by $\Variat*{\mat{a_n}}$.
268 268
 Based on a temperature parameter $\lambda$, a Concrete random variable enforces sparsity but is also continuous and yields informative gradients using automatic differentiation.
269 269
 Samples from a Concrete random variable are unit vectors and for $\lambda \to 0$ their distribution approaches a discrete distribution.
270 270
 
271 271
 Our approximate evaluation of the bound in \cref{eq:variational_bound} during optimization has multiple sources of stochasticity, all of which are unbiased.
272 272
 First, we approximate the expectations using Monte Carlo samples $\mat{\hat{f}_n^{\pix{k}}}$, $\mat{\hat{\alpha}_n^{\pix{k}}}$, and $\mat{\hat{a}_n}$.
273
-And second, the factorization of the bound along the data allows us to use mini-batches for optimization \parencite{salimbeni_doubly_2017, hensman_gaussian_2013}.
273
+And second, the factorization of the bound along the data allows us to use mini-batches for optimization~\parencite{salimbeni_doubly_2017, hensman_gaussian_2013}.
274 274
 
275 275
 
276 276
 \subsection{Approximate Predictions}
@@ -284,7 +284,7 @@ Predictions for a test location $\mat{x_\ast}$ are mixtures of $K$ independent G
284 284
     &\approx \sum_{k=1}^K \hat{a}_\ast^{\pix{k}} \mat{\hat{f}_\ast^{\pix{k}}}.
285 285
 \end{split}
286 286
 \end{align}
287
-The predictive posteriors of the $K$ modes $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically \parencite{hensman_gaussian_2013}.
287
+The predictive posteriors of the $K$ modes $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically~\parencite{hensman_gaussian_2013}.
288 288
 Samples from the predictive density over $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ can be obtained by sampling from the Gaussian process posteriors $\Variat*{\mat{\alpha_\ast^{\pix{k}}} \given \mat{x_\ast}}$ and renormalizing the resulting vector $\mat{\alpha_\ast}$ using the $\softmax$-function.
289 289
 The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model's belief about how many and which of the $K$ modes are relevant at the test location $\mat{x_\ast}$.
290 290
 
@@ -333,7 +333,7 @@ The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model
333 333
 For clarity, we have described the variational bound in terms of a shallow GP.
334 334
 However, as long as their variational bound can be efficiently sampled, any model can be used in place of shallow GPs for the $f^{\pix{k}}$.
335 335
 Since our approximation is based on DSVI for deep Gaussian processes, an extension to deep GPs is straightforward.
336
-Analogously to \parencite{salimbeni_doubly_2017}, our new prior assumption about the $\nth{k}$ latent function values $\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}}$ is given by
336
+Analogously to~\parencite{salimbeni_doubly_2017}, our new prior assumption about the $\nth{k}$ latent function values $\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}}$ is given by
337 337
 \begin{align}
338 338
 \begin{split}
339 339
     \Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}} = \prod_{l=1}^L \Prob*{\mat{F_l^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}} \mat{F_{l-1}^{\prime\pix{k}}}, \mat{Z_l^{\prime\pix{k}}}}
@@ -352,7 +352,7 @@ We collect the latent multi-layer function values as $\mat{F^\prime} = \Set{\mat
352 352
 \end{split}
353 353
 \end{align}
354 354
 where we identify $\mat{f_n^{\prime\pix{k}}} = \mat{f_{n,L}^{\prime\pix{k}}}$.
355
-As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on depends only on the $\nth{n}$ marginal of all layers above sampling from them remains straightforward \parencite{salimbeni_doubly_2017}.
355
+As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on depends only on the $\nth{n}$ marginal of all layers above sampling from them remains straightforward~\parencite{salimbeni_doubly_2017}.
356 356
 The marginal is given by
357 357
 \begin{align}
358 358
 \begin{split}
@@ -386,7 +386,7 @@ First, we apply the MDGP to an artificial data set and showcase how the differen
386 386
 Second, we show how different priors on the different modes can be used to separate a signal from unrelated noise.
387 387
 Finally, we investigate a data set which contains observations of two independent dynamical systems mixed together and show how the MDGP can recover information about both systems and infer the state variable separating the systems.
388 388
 
389
-We use an implementation of MDGP in TensorFlow \parencite{tensorflow2015-whitepaper} based on GPflow \parencite{matthews_gpflow_2017} and the implementation of DSVI \parencite{salimbeni_doubly_2017}.
389
+We use an implementation of MDGP in TensorFlow~\parencite{tensorflow2015-whitepaper} based on GPflow~\parencite{matthews_gpflow_2017} and the implementation of DSVI~\parencite{salimbeni_doubly_2017}.
390 390
 
391 391
 
392 392
 \subsection{Artificial Data Set}
@@ -413,7 +413,7 @@ At $x = -10$ both the two modes and the assignment processes start reverting to
413 413
     \captionof{table}{
414 414
         \label{tab:choicenet}
415 415
         Results on the ChoiceNet data set.
416
-        The gray part of the table shows the results from \parencite{choi_choicenet_2018}.
416
+        The gray part of the table shows the results from~\parencite{choi_choicenet_2018}.
417 417
         We report RMSE comparable to the previous results together with MLL.
418 418
         Both are calculated based on a test set of 1000 equally spaced samples of the noiseless underlying function.
419 419
     }%
@@ -460,7 +460,7 @@ At $x = -10$ both the two modes and the assignment processes start reverting to
460 460
 \end{figure*}
461 461
 %
462 462
 Our second experiment applies MDGP to a one-dimensional regression problem with uniformly distributed outliers in the training data.
463
-We use a task proposed by \textcite{choi_choicenet_2018} where we sample inputs $x \in [-3, 3]$ uniformly and apply the function $\Fun{f}{x} = (1 - \delta)\Fun{\cos}{\sfrac{\pi}{2} \cdot x}\Fun{\exp}{-(\sfrac{x}{2})^2} + \delta \cdot \epsilon$, where $\delta \sim \Fun{\Ber}{\lambda}$ and $\epsilon \sim \Fun{\Uni}{-1, 3}$.
463
+We use a task proposed by~\textcite{choi_choicenet_2018} where we sample inputs $x \in [-3, 3]$ uniformly and apply the function $\Fun{f}{x} = (1 - \delta)\Fun{\cos}{\sfrac{\pi}{2} \cdot x}\Fun{\exp}{-(\sfrac{x}{2})^2} + \delta \cdot \epsilon$, where $\delta \sim \Fun{\Ber}{\lambda}$ and $\epsilon \sim \Fun{\Uni}{-1, 3}$.
464 464
 That is, a fraction $\lambda$ of the training data, the outliers, are replaced by uniform noise.
465 465
 We sample a total of 1000 data points and use $25$ inducing points for every GP in our model.
466 466
 
@@ -468,8 +468,8 @@ Every mode in our model can use a different kernel and therefore encode differen
468 468
 For robust regression, we use a bimodal model consisting of a mode with a squared exponential kernel and one with a white noise kernel.
469 469
 To avoid pathological solutions for high outlier ratios, we add a prior to the likelihood variance of the first mode, which encodes our assumption that there actually is a signal in the training data.
470 470
 
471
-The model proposed in \parencite{choi_choicenet_2018}, called ChoiceNet (CN), is a specific neural network structure and inference algorithm to deal with corrupted data.
472
-In their work, they compare their approach to a standard multi-layer perceptron (MLP), a mixture density network (MDN), standard Gaussian process regression (GPR), leveraged Gaussian process regression (LGPR) \parencite{choi_robust_2016}, and infinite mixtures of Gaussian processes (RGPR) \parencite{rasmussen_infinite_2002}.
471
+The model proposed in~\parencite{choi_choicenet_2018}, called ChoiceNet (CN), is a specific neural network structure and inference algorithm to deal with corrupted data.
472
+In their work, they compare their approach to a standard multi-layer perceptron (MLP), a mixture density network (MDN), standard Gaussian process regression (GPR), leveraged Gaussian process regression (LGPR)~\parencite{choi_robust_2016}, and infinite mixtures of Gaussian processes (RGPR)~\parencite{rasmussen_infinite_2002}.
473 473
 \Cref{tab:choicenet} shows results for outlier rates varied from 0\,\% to 80\,\%.
474 474
 Besides the root mean squared error (RMSE), we also report the mean test log likelihood (MLL) of the mode representing the target function in our model.
475 475
 
@@ -520,7 +520,7 @@ While the function has still been identified well, some of the noise is also exp
520 520
       \bottomrule
521 521
     \end{tabular}
522 522
 \end{table*}
523
-Our third experiment is based on the cart-pole benchmark for reinforcement learning as described by \textcite{barto_neuronlike_1983} and implemented in OpenAI Gym \parencite{brockman_openai_2016}.
523
+Our third experiment is based on the cart-pole benchmark for reinforcement learning as described by~\textcite{barto_neuronlike_1983} and implemented in OpenAI Gym~\parencite{brockman_openai_2016}.
524 524
 In this benchmark, the objective is to apply forces to a cart moving on a frictionless track to keep a pole, which is attached to the cart via a joint, in an upright position.
525 525
 We consider the regression problem of predicting the change of the pole's angle given the current state of the cart and the action applied.
526 526
 The current state of the cart consists of the cart's position and velocity and the pole's angular position and velocity.
@@ -539,10 +539,10 @@ They are generated by sampling trajectories with an aggregated size of 5000 poin
539 539
 
540 540
 For this data set, we use squared exponential kernels for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and 100 inducing points in every GP.
541 541
 We evaluate the performance of deep GPs with up to 5 layers and squared exponential kernels as models for the different modes.
542
-As described in \parencite{salimbeni_doubly_2017}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.
543
-We compare our models with three-layer relu-activated Bayesian neural networks with added latent variables (BNN+LV) as introduced by \textcite{depeweg_learning_2016}.
544
-These latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for model-based reinforcement learning \parencite{depeweg_decomposition_2018}.
545
-We also compare to three kinds of sparse GPs \parencite{hensman_scalable_2015}.
542
+As described in~\parencite{salimbeni_doubly_2017}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.
543
+We compare our models with three-layer relu-activated Bayesian neural networks with added latent variables (BNN+LV) as introduced by~\textcite{depeweg_learning_2016}.
544
+These latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for model-based reinforcement learning~\parencite{depeweg_decomposition_2018}.
545
+We also compare to three kinds of sparse GPs~\parencite{hensman_scalable_2015}.
546 546
 They are trained on the mixed data set, the default system and the short-pole system respectively.
547 547
 
548 548
 \Cref{tab:cartpole} shows mean training and test log likelihoods and their standard error over ten runs for these models.

Loading…
Cancel
Save