Browse Source

Add Daniel's feedback

master
Markus Kaiser 2 months ago
parent
commit
6eda99c9fe
2 changed files with 48 additions and 48 deletions
  1. BIN
      dynamic_dirichlet_deep_gp.pdf
  2. 48
    48
      dynamic_dirichlet_deep_gp.tex

BIN
dynamic_dirichlet_deep_gp.pdf View File


+ 48
- 48
dynamic_dirichlet_deep_gp.tex View File

@@ -23,7 +23,7 @@
23 23
     \and
24 24
     Clemens Otte\inst{1}
25 25
     \and
26
-    Thomas Runkler\inst{1,2}
26
+    Thomas A. Runkler\inst{1,2}
27 27
     \and
28 28
     Carl Henrik Ek\inst{3}
29 29
 }
@@ -39,9 +39,9 @@
39 39
 \maketitle
40 40
 
41 41
 \begin{abstract}
42
-    The data association problem is concerned with separating data coming from different generating processes, for example when data come from different data sources, contain significant noise, or exhibit multimodality.
42
+    The data association problem is concerned with separating data coming from different generating processes, for example when data comes from different data sources, contain significant noise, or exhibit multimodality.
43 43
     We present a fully Bayesian approach to this problem.
44
-    Our model is capable of simultaneously solving the data association problem and the induced supervised learning problems.
44
+    Our model is capable of simultaneously solving the data association problem and the induced supervised learning problem.
45 45
     Underpinning our approach is the use of Gaussian process priors to encode the structure of both the data and the data associations.
46 46
     We present an efficient learning scheme based on doubly stochastic variational inference and discuss how it can be applied to deep Gaussian process priors.
47 47
     % \keywords{Bla \and blubb}
@@ -51,7 +51,7 @@
51 51
 \section{Introduction}
52 52
 \label{sec:introduction}
53 53
 Real-world data often include multiple operational regimes of the considered system, for example a wind turbine or gas turbine~\parencite{hein_benchmark_2017}.
54
-As an example, consider a model describing the lift resulting from airflow around the wing profile of an airplane as a function of attack angle.
54
+As an example, consider a model describing the lift resulting from airflow around the wing profile of an airplane as a function of the attack angle.
55 55
 At a low angle the lift increases linearly with attack angle until the wing stalls and the characteristic of the airflow fundamentally changes.
56 56
 Building a truthful model of such data requires learning two separate models and correctly associating the observed data to each of the dynamical regimes.
57 57
 A similar example would be if our sensors that measure the lift are faulty in a manner such that we either get an accurate reading or a noisy one.
@@ -59,7 +59,7 @@ Estimating a model in this scenario is often referred to as a \emph{data associa
59 59
 
60 60
 \Cref{fig:choicenet_data} shows an example of faulty sensor data, where sensor readings are disturbed by uncorrelated and asymmetric noise.
61 61
 Applying standard machine learning approaches to such data can lead to model pollution, where the expressive power of the model is used to explain noise instead of the underlying signal.
62
-Solving the data association problem by factorizing the data into signal and noise gives rise to a principled approach to avoiding this behavior.
62
+Solving the data association problem by factorizing the data into signal and noise gives rise to a principled approach to avoid this behavior.
63 63
 
64 64
 \begin{figure}[t]
65 65
     \centering
@@ -73,11 +73,11 @@ Early approaches to explaining data using multiple generative processes are base
73 73
 The assignment of data points to local experts is handled by a gating network, which learns a function from the inputs to assignment probabilities.
74 74
 However, it is still a central assumption of these models that at every position in the input space exactly one expert should explain the data.
75 75
 Another approach is presented in~\parencite{bishop_mixture_1994}, where the multimodal regression tasks are interpreted as a density estimation problem.
76
-A high number of candidate distributions are reweighed to match the observed data without modeling the underlying generative process.
76
+A high number of candidate distributions is reweighed to match the observed data without modeling the underlying generative process.
77 77
 
78 78
 In contrast, we are interested in a generative process, where data at the same location in the input space could have been generated by a number of global independent processes.
79 79
 Inherently, the data association problem is ill-posed and requires assumptions on both the underlying functions and the association of the observations.
80
-In~\parencite{lazaro-gredilla_overlapping_2012} the authors place Gaussian process priors on the different generative processes which are assumed to be relevant globally.
80
+In~\parencite{lazaro-gredilla_overlapping_2012} the authors place Gaussian process (GP) priors on the different generative processes which are assumed to be relevant globally.
81 81
 The associations are modelled via a latent association matrix and inference is carried out using an expectation maximization algorithm.
82 82
 This approach takes both the inputs and the outputs of the training data into account to solve the association problem.
83 83
 A drawback is that the model cannot give a posterior estimate about the relevance of the different generating processes at different locations in the input space.
@@ -85,17 +85,17 @@ This means that the model can be used for data exploration but additional inform
85 85
 Another approach in~\parencite{bodin_latent_2017} expands this model by allowing interdependencies between the different generative processes and formulating the association problem as an inference problem on a latent space and a corresponding covariance function.
86 86
 However, in this approach the number of components is a free parameter and is prone to overfitting, as the model has no means of turning off components.
87 87
 
88
-In this paper we formulate a Bayesian model for the data association problem.
89
-Underpinning our approach is the use of Gaussian process priors which encode structure both on the functions and the associations themselves, allowing us to incorporate the available prior knowledge about the proper factorization into the learning problem.
90
-The use of Gaussian process priors allows us to achieve principled regularization without reducing the solution space leading to a well-regularized learning problem.
88
+In this paper, we formulate a Bayesian model for the data association problem.
89
+Underpinning our approach is the use of GP priors which encode structure both on the functions and the associations themselves, allowing us to incorporate the available prior knowledge about the proper factorization into the learning problem.
90
+The use of GP priors allows us to achieve principled regularization without reducing the solution space leading to a well-regularized learning problem.
91 91
 Importantly, we simultaneously solve the association problem for the training data taking both inputs and outputs into account while also obtaining posterior belief about the relevance of the different generating processes in the input space.
92 92
 Our model can describe non-stationary processes in the sense that a different number of processes can be activated in different locations in the input space.
93
-We describe this non-stationary structure using additional Gaussian process priors which allows us to make full use of problem specific knowledge.
93
+We describe this non-stationary structure using additional GP priors which allows us to make full use of problem specific knowledge.
94 94
 This leads to a flexible yet interpretable model with a principled treatment of uncertainty.
95 95
 
96 96
 The paper has the following contributions:
97
-In \cref{sec:model}, we propose the data association with Gaussian Processes model (DAGP).
98
-In \cref{sec:variational_approximation}, we present an efficient learning scheme via a variational approximation which allows us to simultaneously train all parts of our model via stochastic optimization and show how the same learning scheme can be applied to deep Gaussian process priors.
97
+In \cref{sec:model}, we propose the data association with Gaussian processes model (DAGP).
98
+In \cref{sec:variational_approximation}, we present an efficient learning scheme via a variational approximation which allows us to simultaneously train all parts of our model via stochastic optimization and show how the same learning scheme can be applied to deep GP priors.
99 99
 We demonstrate our model on a noise separation problem, an artificial multimodal data set, and a multi-regime regression problem based on the cart-pole benchmark in \cref{sec:experiments}.
100 100
 
101 101
 
@@ -107,23 +107,23 @@ We demonstrate our model on a noise separation problem, an artificial multimodal
107 107
     \caption{
108 108
         \label{fig:dynamic_graphical_model}
109 109
         The graphical model of DAGP.
110
-        The violet observations $(\mat{x_n}, \mat{y_n})$ are generated by the green latent process.
110
+        The violet observations $(\mat{x_n}, \mat{y_n})$ are generated by the latent process (green).
111 111
         Exactly one of the $K$ latent functions $f^{\pix{k}}$ and likelihood $\mat{y_n^{\pix{k}}}$ are evaluated to generate $\mat{y_n}$.
112 112
         We can place shallow or deep GP priors on these latent function values $\mat{f_n^{\pix{k}}}$.
113 113
         The assignment $\mat{a_n}$ to a latent function is driven by input-dependent weights $\mat{\alpha_n^{\pix{k}}}$ which encode the relevance of the different functions at $\mat{x_n}$.
114
-        The different parts of the model are determined by the yellow hyperparameters and blue variational parameters.
114
+        The different parts of the model are determined by the hyperparameters $\mat{\theta}, \mat{\sigma}$ (yellow) and variational parameters $\mat{u}$ (blue).
115 115
     }
116 116
 \end{figure}
117
-The data association with Gaussian Processes (DAGP) model assumes that there exist $K$ independent functions $\Set*{f^{\pix{k}}}_{k=1}^K$, which generate pairs of observations $\D = \Set*{(\mat{x_n}, \mat{y_n})}_{n=1}^N$.
117
+The data association with Gaussian processes (DAGP) model assumes that there exist $K$ independent functions $\Set*{f^{\pix{k}}}_{k=1}^K$, which generate pairs of observations $\D = \Set*{(\mat{x_n}, \mat{y_n})}_{n=1}^N$.
118 118
 Each data point is generated by evaluating one of the $K$ latent functions and adding Gaussian noise from a corresponding likelihood.
119 119
 The assignment of the $\nth{n}$ data point to one of the functions is specified by the indicator vector $\mat{a_n} \in \Set*{0, 1}^K$, which has exactly one non-zero entry.
120 120
 Our goal is to formulate simultaneous Bayesian inference on the functions $f^{\pix{k}}$ and the assignments $\mat{a_n}$.
121 121
 
122
-For notational conciseness we collect all $N$ inputs as $\mat{X} = \left(\mat{x_1}, \ldots, \mat{x_N}\right)$ and all outputs as $\mat{Y} = \left(\mat{y_1}, \ldots, \mat{y_N}\right)$.
122
+For notational conciseness, we collect all $N$ inputs as $\mat{X} = \left(\mat{x_1}, \ldots, \mat{x_N}\right)$ and all outputs as $\mat{Y} = \left(\mat{y_1}, \ldots, \mat{y_N}\right)$.
123 123
 We further denote the $\nth{k}$ latent function value associated with the $\nth{n}$ data point as $\rv{f_n^{\pix{k}}} = \Fun{f^{\pix{k}}}{\mat{x_n}}$ and collect them as $\mat{F^{\pix{k}}} = \left( \rv{f_1^{\pix{k}}}, \ldots, \rv{f_N^{\pix{k}}} \right)$ and $\mat{F} = \left( \mat{F^{\pix{1}}}, \ldots, \mat{F^{\pix{K}}} \right)$.
124 124
 We refer to the $\nth{k}$ entry in $\mat{a_n}$ as $a_n^{\pix{k}}$ and denote $\mat{A} = \left(\mat{a_1}, \ldots, \mat{a_N}\right)$.
125 125
 
126
-Given this notation, the marginal likelihood of the DAGP can be separated into the likelihood, the latent function processes, and the assignment process and is given by,
126
+Given this notation, the marginal likelihood of DAGP can be separated into the likelihood, the latent function processes, and the assignment process and is given by,
127 127
 \begin{align}
128 128
 \begin{split}
129 129
     \label{eq:true_marginal_likelihood}
@@ -160,18 +160,18 @@ The prior on the assignments $\mat{A}$ is given by marginalizing the $\mat{\alph
160 160
 \end{split}
161 161
 \end{align}
162 162
 Modelling the relationship between the input and the associations allows us to efficiently model data, which, for example, is unimodal in some parts of the input space and bimodal in others.
163
-A simple smoothness prior will encode a belief for how quickly we believe the components switch across the input domain.
163
+A simple smoothness prior will encode a belief for how quickly the components switch across the input domain.
164 164
 
165 165
 Since the GPs of the $\mat{\alpha^{\pix{k}}}$ use a zero mean function, our prior assumption is a uniform distribution of the different generative processes everywhere in the input space.
166 166
 If inference on the $\mat{a_n}$ reveals that, say, all data points at similar positions in the input space can be explained by the same $\nth{k}$ process, the belief about $\mat{\alpha}$ can be adjusted to make a non-uniform distribution favorable at this position, thereby increasing the likelihood via $\Prob*{\mat{A} \given \mat{X}}$.
167 167
 This mechanism introduces an incentive for the model to use as few functions as possible to explain the data and importantly allows us to predict a relative importance of these functions when calculating the posterior of the new observations $\mat{x_\ast}$.
168 168
 
169 169
 \Cref{fig:dynamic_graphical_model} shows the resulting graphical model, which divides the generative process for every data point in the application of the latent functions on the left side and the assignment process on the right side.
170
-The interdependencies between the data points are introduced through the Gaussian process priors on $\rv{f_n^{\pix{k}}}$ and $\rv{\alpha_n^{\pix{k}}}$ and depend on the hyperparameters $\mat{\theta} = \Set*{\mat{\theta^{\pix{k}}}, \mat{\theta_\alpha^{\pix{k}}}, \sigma^{\pix{k}}}_{k=1}^K$.
170
+The interdependencies between the data points are introduced through the GP priors on $\rv{f_n^{\pix{k}}}$ and $\rv{\alpha_n^{\pix{k}}}$ and depend on the hyperparameters $\mat{\theta} = \Set*{\mat{\theta^{\pix{k}}}, \mat{\theta_\alpha^{\pix{k}}}, \sigma^{\pix{k}}}_{k=1}^K$.
171 171
 
172 172
 The priors for the $f^{\pix{k}}$ can be chosen independently to encode different prior assumptions about the underlying processes.
173
-In \cref{subsec:choicenet} we use different kernels to separate a non-linear signal from a noise process.
174
-Going further, we can also use deep Gaussian processes as priors for the $f^{\pix{k}}$~\parencite{damianou_deep_2013, salimbeni_doubly_2017}.
173
+In \cref{subsec:choicenet}, we use different kernels to separate a non-linear signal from a noise process.
174
+Going further, we can also use deep GP as priors for the $f^{\pix{k}}$~\parencite{damianou_deep_2013, salimbeni_doubly_2017}.
175 175
 Since many real word systems are inherently hierarchical, prior knowledge can often be formulated more easily using composite functions~\parencite{kaiser_bayesian_2018}.
176 176
 
177 177
 
@@ -185,7 +185,7 @@ This bound can be sampled efficiently and allows us to optimize both the models
185 185
 
186 186
 \subsection{Variational Lower Bound}
187 187
 \label{subsec:lower_bound}
188
-As first introduced by~\textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.
188
+As first introduced by~\textcite{titsias_variational_2009}, we augment all GP in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.
189 189
 We collect them as $\mat{Z} = \Set*{\mat{Z^{\pix{k}}}, \mat{Z_\alpha^{\pix{k}}}}_{k=1}^K$ and $\mat{U} = \Set*{\mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K$.
190 190
 Taking the function $f^{\pix{k}}$ and its corresponding GP as an example, the inducing variables $\mat{u^{\pix{k}}}$ are jointly Gaussian with the latent function values $\mat{F^{\pix{k}}}$ of the observed data by the definition of GPs.
191 191
 We follow~\parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.
@@ -210,13 +210,13 @@ The variational posterior of a single GP can then be written as,
210 210
 \end{split}
211 211
 \end{align}
212 212
 which can be evaluated analytically, since it is a convolution of Gaussians.
213
-This formulation simplifies inference within single Gaussian processes.
213
+This formulation simplifies inference within single GPs.
214 214
 Next, we discuss how to handle the correlations between the different functions and the assignment processes.
215 215
 
216 216
 Given a set of assignments $\mat{A}$, this factorization along the data points is preserved in our model due to the assumed independence of the different functions in~\cref{eq:true_marginal_likelihood}.
217 217
 The independence is lost if the assignments are unknown.
218 218
 In this case, both the (a priori independent) assignment processes and the functions influence each other through data with unclear assignments.
219
-Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by~\textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.
219
+Following the ideas of doubly stochastic variational inference (DSVI) presented by~\textcite{salimbeni_doubly_2017} in the context of deep GPs, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.
220 220
 That is, our variational posterior takes the factorized form,
221 221
 \begin{align}
222 222
 \begin{split}
@@ -260,8 +260,8 @@ This bound has complexity $\Fun*{\Oh}{NM^2K}$ to evaluate.
260 260
 An important property of the variational bound for DSVI~\parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.
261 261
 Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations~\parencite{kingma_variational_2015,rezende_stochastic_2014}.
262 262
 
263
-While it would not be necessary to sample from the different function, since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically~\parencite{hensman_gaussian_2013}, we apply this idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.
264
-For $\mat{\alpha}$, the analytical propagation of uncertainties through the $\softmax$ renormalization and multinomial likelihoods is intractable but can easily be evaluated using sampling.
263
+Note that it would not be necessary to sample from the different processes, since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically~\parencite{hensman_gaussian_2013}.
264
+However, we apply the sampling scheme to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$ as for $\mat{\alpha}$, the analytical propagation of uncertainties through the $\softmax$ renormalization and multinomial likelihoods is intractable but can easily be evaluated using sampling.
265 265
 
266 266
 We optimize $\Ell_{\text{DAGP}}$ to simultaneously recover maximum likelihood estimates of the hyperparameters $\mat{\theta}$, the variational parameters $\Set*{\mat{Z}, \mat{m}, \mat{S}}$, and assignments $\mat{A}$.
267 267
 For every $n$, we represent the belief about $\mat{a_n}$ as a $K$-dimensional discrete distribution $\Variat*{\mat{a_n}}$.
@@ -269,14 +269,14 @@ This distribution models the result of drawing a sample from $\Multinomial*{\mat
269 269
 
270 270
 Since we want to optimize $\Ell_{\text{DAGP}}$ using (stochastic) gradient descent, we need to employ a continuous relaxation to gain informative gradients of the bound with respect to the binary (and discrete) vectors $\mat{a_n}$.
271 271
 One straightforward way to relax the problem is to use the current belief about $\Variat*{\mat{a_n}}$ as parameters for a convex combination of the $\mat{f_n^{\pix{k}}}$, that is, to approximate $\mat{f_n} \approx \sum_{k=1}^K \Variat*{\mat{a_n^{\pix{k}}}}\mat{\hat{f}_n^{\pix{k}}}$.
272
-Using this relaxation causes multiple problems in practice.
273
-Most importantly, explaining data points as mixtures of the different generating processes can substantially simplify the learning problem while violating the modelling assumption that every data point was generated using exactly one function.
272
+Using this relaxation is problematic in practice.
273
+Explaining data points as mixtures of the different generating processes violates the modelling assumption that every data point was generated using exactly one function but can substantially simplify the learning problem.
274 274
 Because of this, special care must be taken during optimization to enforce the sparsity of $\Variat*{\mat{a_n}}$.
275 275
 
276 276
 To avoid this problem, we propose using a different relaxation based on additional stochasticity.
277
-Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by~\textcite{maddison_concrete_2016}, parameterized by $\Variat*{\mat{a_n}}$.
278
-Based on a temperature parameter $\lambda$, a Concrete random variable enforces sparsity but is also continuous and yields informative gradients using automatic differentiation.
279
-Samples from a Concrete random variable are unit vectors and for $\lambda \to 0$ their distribution approaches a discrete distribution.
277
+Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a concrete random variable as suggested by~\textcite{maddison_concrete_2016}, parameterized by $\Variat*{\mat{a_n}}$.
278
+Based on a temperature parameter $\lambda$, a concrete random variable enforces sparsity but is also continuous and yields informative gradients using automatic differentiation.
279
+Samples from a concrete random variable are unit vectors and for $\lambda \to 0$ their distribution approaches a discrete distribution.
280 280
 
281 281
 Our approximate evaluation of the bound in \cref{eq:variational_bound} during optimization has multiple sources of stochasticity, all of which are unbiased.
282 282
 First, we approximate the expectations using Monte Carlo samples $\mat{\hat{f}_n^{\pix{k}}}$, $\mat{\hat{\alpha}_n^{\pix{k}}}$, and $\mat{\hat{a}_n}$.
@@ -294,8 +294,8 @@ Predictions for a test location $\mat{x_\ast}$ are mixtures of $K$ independent G
294 294
     \approx \sum_{k=1}^K \hat{a}_\ast^{\pix{k}} \mat{\hat{f}_\ast^{\pix{k}}}.
295 295
 \end{split}
296 296
 \end{align}
297
-The predictive posteriors of the $K$ functions $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically~\parencite{hensman_gaussian_2013}.
298
-Samples from the predictive density over $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ can be obtained by sampling from the Gaussian process posteriors $\Variat*{\mat{\alpha_\ast^{\pix{k}}} \given \mat{x_\ast}}$ and renormalizing the resulting vector $\mat{\alpha_\ast}$ using the $\softmax$-function.
297
+The predictive posteriors of the $K$ functions $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow GPs and can be calculated analytically~\parencite{hensman_gaussian_2013}.
298
+Samples from the predictive density over $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ can be obtained by sampling from the GP posteriors $\Variat*{\mat{\alpha_\ast^{\pix{k}}} \given \mat{x_\ast}}$ and renormalizing the resulting vector $\mat{\alpha_\ast}$ using the $\softmax$-function.
299 299
 The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model's belief about how many and which of the $K$ generative processes are relevant at the test location $\mat{x_\ast}$ and their relative probability.
300 300
 
301 301
 
@@ -303,11 +303,11 @@ The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model
303 303
 \label{subsec:deep_gp}
304 304
 For clarity, we have described the variational bound in terms of a shallow GP.
305 305
 However, as long as their variational bound can be efficiently sampled, any model can be used in place of shallow GPs for the $f^{\pix{k}}$.
306
-Since our approximation is based on DSVI for deep Gaussian processes, an extension to deep GPs is straightforward.
306
+Since our approximation is based on DSVI, an extension to deep GPs is straightforward.
307 307
 Analogously to~\parencite{salimbeni_doubly_2017}, our new prior assumption about the $\nth{k}$ latent function values $\Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}}$ is given by,
308 308
 \begin{align}
309 309
 \begin{split}
310
-    \Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}} = \prod_{l=1}^L \Prob*{\mat{F_l^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}} \mat{F_{l-1}^{\prime\pix{k}}}, \mat{Z_l^{\prime\pix{k}}}}
310
+    \Prob*{\mat{F^{\prime\pix{k}}} \given \mat{X}} = \prod_{l=1}^L \Prob*{\mat{F_l^{\prime\pix{k}}} \given \mat{u_l^{\prime\pix{k}}} \mat{F_{l-1}^{\prime\pix{k}}}, \mat{Z_l^{\prime\pix{k}}}},
311 311
 \end{split}
312 312
 \end{align}
313 313
 for an $L$-layer deep GP and with $\mat{F_0^{\prime\pix{k}}} \coloneqq \mat{X}$.
@@ -323,7 +323,7 @@ We collect the latent multi-layer function values as $\mat{F^\prime} = \Set{\mat
323 323
 \end{split}
324 324
 \end{align}
325 325
 where we identify $\mat{f_n^{\prime\pix{k}}} = \mat{f_{n,L}^{\prime\pix{k}}}$.
326
-As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on depends only on the $\nth{n}$ marginal of all layers above sampling from them remains straightforward~\parencite{salimbeni_doubly_2017}.
326
+As the $\nth{n}$ marginal of the $\nth{L}$ layer depends only on the $\nth{n}$ marginal of all layers above sampling from them remains straightforward~\parencite{salimbeni_doubly_2017}.
327 327
 The marginal is given by,
328 328
 \begin{align}
329 329
 \begin{split}
@@ -385,11 +385,11 @@ This extended bound thus has complexity $\Fun*{\Oh}{NM^2LK}$ to evaluate in the
385 385
         \bottomrule
386 386
     \end{tabularx}
387 387
 \end{table}
388
-In this section we investigate the behavior of the DAGP model.
388
+In this section, we investigate the behavior of the DAGP model.
389 389
 We use an implementation of DAGP in TensorFlow~\parencite{tensorflow2015-whitepaper} based on GPflow~\parencite{matthews_gpflow_2017} and the implementation of DSVI~\parencite{salimbeni_doubly_2017}.
390 390
 \Cref{tab:model_capabilities} compares qualitative properties of DAGP and related work.
391 391
 All models can solve standard regression problems and yield unimodal predictive distributions or, in case of multi-layer perceptrons (MLP), a single point estimate.
392
-Both standard Gaussian process regression (GPR) and MLP do not impose structure which enables the model to handle multi-modal data.
392
+Both standard Gaussian process regression (GPR) and MLP do not impose structure which enables the models to handle multi-modal data.
393 393
 Mixture density networks (MDN)~\parencite{bishop_mixture_1994} and the infinite mixtures of Gaussian processes (RGPR)~\parencite{rasmussen_infinite_2002} model yield multi-modal posteriors through mixtures with many components but do not solve an association problem.
394 394
 Similarly, Bayesian neural networks with added latent variables (BNN+LV)~\parencite{depeweg_learning_2016} represent such a mixture through a continuous latent variable.
395 395
 Both the overlapping mixtures of Gaussian processes (OMGP)~\parencite{lazaro-gredilla_overlapping_2012} model and DAGP explicitly model the data association problem and yield independent models for the different generating processes.
@@ -398,17 +398,17 @@ In contrast, DAGP infers a spacial posterior of this relevance.
398 398
 We evaluate our model on three problems to highlight the following advantages of the explicit structure of DAGP:
399 399
 
400 400
 \emph{Interpretable priors give structure to ill-posed data association problems.}
401
-In \cref{subsec:choicenet} we consider a noise separation problem, where a signal of interest is disturbed with uniform noise.
401
+In \cref{subsec:choicenet}, we consider a noise separation problem, where a signal of interest is disturbed with uniform noise.
402 402
 To solve this problem, assumptions about what constitutes a signal are needed.
403 403
 The hierarchical structure of DAGP allows us to formulate independent and interpretable priors on the noise and signal processes.
404 404
 
405 405
 \emph{Predictive associations represent knowledge about the relevance of generative processes.}
406
-In \cref{subsec:semi_bimodal} we investigate the implicit incentive of DAGP to explain data using as few processes as possible.
406
+In \cref{subsec:semi_bimodal}, we investigate the implicit incentive of DAGP to explain data using as few processes as possible.
407 407
 Additional to a joint posterior explaining the data, DAGP also gives insight into the relative importance of the different processes in different parts of the input space.
408 408
 DAGP is able to explicitly recover the changing number of modes in a data set.
409 409
 
410 410
 \emph{Separate models for independent generating processes avoid model pollution.}
411
-In \cref{subsec:cartpole} we simulate a system with multiple operational regimes via mixed observations of two different cart-pole systems.
411
+In \cref{subsec:cartpole}, we simulate a system with multiple operational regimes via mixed observations of two different cart-pole systems.
412 412
 DAGP successfully learns an informative joint posterior by solving the underlying association problem.
413 413
 We show that the DAGP posterior contains two separate models for the two original operational regimes.
414 414
 
@@ -526,7 +526,7 @@ We uniformly sample 350 data points in the interval $x \in [-2\pi, 2\pi]$ and ob
526 526
 The resulting data set $\D = \Set{\left( x, y_1 \right), \left( x, y_2 \right), \left( x, y_3 \right)}$ is trimodal in the interval $[0, 5]$ and is otherwise bimodal with one mode containing double the amount of data than the other.
527 527
 
528 528
 We use squared exponential kernels as priors for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and $25$ inducing points in every GP.
529
-\Cref{fig:semi_bimodal} shows the posterior of an DAGP with $K = 4$ modes applied to the data, which correctly identified the underlying functions.
529
+\Cref{fig:semi_bimodal} shows the posterior of a DAGP with $K = 4$ modes applied to the data, which correctly identified the underlying functions.
530 530
 The figure shows the posterior belief about the assignments $\mat{A}$ and illustrates that DAGP recovered that it needs only three of the four available modes to explain the data.
531 531
 One of the modes is only assigned points in the interval $[0, 5]$ where the data is actually trimodal.
532 532
 
@@ -538,7 +538,7 @@ The DAGP is implicitly incentivized to explain the data using as few modes as po
538 538
 At $x = -10$ the inferred modes and assignment processes start reverting to their respective priors away from the data.
539 539
 
540 540
 
541
-\subsection{Mixed Cart-Pole Systems}
541
+\subsection{Mixed Cart-pole Systems}
542 542
 \label{subsec:cartpole}
543 543
 \begin{table}[t]
544 544
     \centering
@@ -606,12 +606,12 @@ We consider three test sets, one sampled from the default system, one sampled fr
606 606
 They are generated by sampling trajectories with an aggregated size of 5000 points from each system for the first two sets and their concatenation for the mixed set.
607 607
 
608 608
 For this data set, we use squared exponential kernels for both the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and 100 inducing points in every GP.
609
-We evaluate the performance of deep GPs with up to 3 layers and squared exponential kernels as models for the different functions.
609
+We evaluate the performance of deep GPs with up to three layers and squared exponential kernels as models for the different functions.
610 610
 As described in~\parencite{salimbeni_doubly_2017,kaiser_bayesian_2018}, we use identity mean functions for all but the last layers and initialize the variational distributions with low covariances.
611 611
 We compare our models with OMGP and three-layer relu-activated Bayesian neural networks with added latent variables (BNN+LV).
612 612
 The latent variables can be used to effectively model multimodalities and stochasticity in dynamical systems for model-based reinforcement learning~\parencite{depeweg_decomposition_2018}.
613
-We also compare to three kinds of sparse GPs (GPR)~\parencite{hensman_scalable_2015}.
614
-They are trained on the mixed data set, the default system and the short-pole system respectively and serve as a baseline-comparison as these models cannot handle multi-modal data.
613
+We also compare DAGP to three kinds of sparse GPs (GPR)~\parencite{hensman_scalable_2015}.
614
+They are trained on the mixed data set, the default system and the short-pole system respectively and serve as a baseline comparison as these models cannot handle multi-modal data.
615 615
 
616 616
 \Cref{tab:cartpole} shows results for ten runs of these models.
617 617
 The GPR model predicts a unimodal posterior for the mixed data set which covers both systems.
@@ -636,7 +636,7 @@ The more expressive deep GPs model the default system slightly better while sacr
636 636
 We have presented a fully Bayesian model for the data association problem.
637 637
 Our model factorises the observed data into a set of independent processes and provides a model over both the processes and their association to the observed data.
638 638
 The data association problem is inherently ill-constrained and requires significant assumptions to recover a solution.
639
-In this paper, we make use of interpretable Gaussian process priors allowing global a priori information to be included into the model.
639
+In this paper, we make use of interpretable GP priors allowing global a priori information to be included into the model.
640 640
 Importantly, our model is able to exploit information both about the underlying functions and the association structure.
641 641
 We have derived a principled approximation to the marginal likelihood which allows us to perform inference for flexible hierarchical processes.
642 642
 In future work, we would like to incorporate the proposed model in a reinforcement learning scenario where we study a dynamical system with different operational regimes.

Loading…
Cancel
Save