You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

377 lines
24 KiB

% We use precompiled images and do not add tikz for speed of compilation.
% We set this for hyperref
\title{Dynamic Multi-Modal Deep Gaussian Processes}
\author{\href{}{Markus Kaiser}}
\aistatstitle{Dynamic Multi-Modal Deep Gaussian Processes}
\aistatsauthor{ Markus Kaiser \And Clemens Otte \And Thomas Runkler \And Carl Henrik Ek }
\aistatsaddress{ Technical University of Munich \\ Siemens AG \And Siemens AG \And Technical University of Munich \\ Siemens AG \And University of Bristol }]
Praedicimus ergo sumus.
\item Introduce the problem of multimodalities
\item Dynamical systems with changing conditions as an example?
\item Discuss other models and related work
\item Contributions
\item Relative mode-weights with $\mat{\alpha}$-GPs
\item Variational bound based on doubly stochastic DGPs
\item Relaxation for the attributions based on Concrete distributions.
\section{The Dynamic Multi-Modal Gaussian Process Model}
Our proposed graphical model.
We assume that our observations $(\mat{x_n}, \mat{y_n})$ are generated using exactly one of the $K$ latent functions $\rv{f^{\pix{k}}}$.
We want to learn a posterior about these functions and the assignments $\mat{a_n}$ which are in turn driven by an input-dependent weighing function $\mat{\alpha}$ which encodes the relative relevance of the different modes at a point in the input space.
We derive maximum likelihood estimations of the yellow hyperparameters and recover an explicit representation of all latent functions using the blue variational parameters.
The dynamic multi-modal Gaussian Process (DMGP) model assumes that there exist $K$ independent functions $\Set*{f^{\pix{k}}}_{k=1}^K$ (or \emph{modes}) which generate the pairs of observations $\D = \Set*{(\mat{x_n}, \mat{y_n})}_{n=1}^N$ in a data set.
Every data point is generated by evaluating one of the $K$ latent functions and adding Gaussian noise from a corresponding likelihood.
The assignment of the $\nth{n}$ data point to one of the modes is specified by the indicator vector $\mat{a_n} \in \Set*{0, 1}^K$ which has exactly one non-zero entry.
Our goal is to formulate simultaneous Bayesian inference on the functions $f^{\pix{k}}$ and the assignments $\mat{a_n}$.
For notational conciseness, we collect all inputs in the $N \times D_{\text{in}}$ matrix $\mat{X} = \left(\mat{x_1}, \ldots, \mat{x_N}\right)$ and all outputs in the $N \times D$ matrix $\mat{Y} = \left(\mat{y_1}, \ldots, \mat{y_N}\right)$.
We further denote the evaluation of the $\nth{k}$ latent function associated with the $\nth{n}$ data point as $\rv{f_n^{\pix{k}}} = \Fun{f^{\pix{k}}}{\mat{x_n}}$ and collect them analogously as $\mat{F^{\pix{k}}} = \left( \rv{f_1^{\pix{k}}}, \ldots, \rv{f_N^{\pix{k}}} \right)$ and $\mat{F} = \left( \mat{F^{\pix{1}}}, \ldots, \mat{F^{\pix{K}}} \right)$.
For consistency, we refer to the $\nth{k}$ entry in $\mat{a_n}$ as $a_n^{\pix{k}}$ and also collect these values as $\mat{A} = \left(\mat{a_1}, \ldots, \mat{a_N}\right)$.
Given the notation above, the marginal likelihood of the DMGP can be separated in the likelihood, the latent function processes and the assignment process and is given by
\Prob*{\mat{Y} \given \mat{X}} &= \\
\Prob*{\mat{Y} \given \mat{F}, \mat{A}}
\Prob*{\mat{F} \given \mat{X}}
\Prob*{\mat{A} \given \mat{X}}
\diff \mat{A} \diff \mat{F}\text{,} \\
\Prob*{\mat{Y} \given \mat{F}, \mat{A}} &= \prod_{k=1}^K \Gaussian*{\mat{Y} \given \given \mat{F^{\pix{k}}}, \left(\sigma^{\pix{k}}\right)^2\Eye}^{\mat{A^{\pix{k}}}},
where $\sigma^{\pix{k}}$ is the noise level of the $\nth{k}$ Gaussian likelihood.
Since we assume the $K$ modes to be independent given the data and assignments, we place independent GP priors on the latent functions,
\Prob*{\mat{F} \given \mat{X}} &= \prod_{k=1}^K \Gaussian*{\mat{F^{\pix{k}}} \given \Fun*{\mu^{\pix{k}}}{\mat{X}}, \Fun*{k^{\pix{k}}}{\mat{X}, \mat{X}}}.
Our prior on the assignment process is composite.
First, we assume that the $\mat{a_n}$ are drawn independently from multinomial distributions with logit parameters $\mat{\alpha_n} = \left( \alpha_n^{\pix{1}}, \ldots, \alpha_n^{\pix{K}} \right)$.
A usual approach to specifying the $\mat{\alpha_n}$ is to assume them to be known a priori and to be equal for all data points~\parencite{lazaro-gredilla_overlapping_2011}.
Instead, we want to infer them from the data.
Additionally, we assume they change dependent on the position in the input space.
This allows us to efficiently model data which, for example, is unimodal in some parts of the input space and bimodal in others.
We place independent GP priors on the $\mat{\alpha^{\pix{k}}}$ with
\Prob*{\mat{\alpha} \given \mat{X}} = \prod_{k=1}^K \Gaussian*{\rv{\alpha^{\pix{k}}} \given \mat{0}, \Fun{k_\alpha^{\pix{k}}}{\mat{X}, \mat{X}}}.
The prior on the assignments $\mat{A}$ is given by marginalizing the $\mat{\alpha^{\pix{k}}}$ which when normalized parametrize a batch of multinomial distributions
\Prob*{\mat{A} \given \mat{X}} &=
\Multinomial*{\mat{A} \given \Fun{\softmax}{\mat{\alpha}}} \Prob*{\mat{\alpha} \given \mat{X}}
\diff \rv{\alpha}.
Since the GPs of the $\mat{\alpha^{\pix{k}}}$ use a zero mean function, our prior assumption is a uniform distribution between the different modes everywhere in the input space.
If inference on the $\mat{a_n}$ reveals that, say, all data points at similar positions in the input space can be explained by the same $\nth{k}$ mode, the belief about $\mat{\alpha}$ can be adjusted to make a non-uniform mode distribution favourable at this position, thereby increasing the likelihood via $\Prob*{\mat{A} \given \mat{X}}$.
This mechanism introduces an incentive for the model to use as few modes as possible to explain the data and allows us to predict a relative importance of the modes when calculating the posterior of new observations $\mat{x^\ast}$.
\todo[inline]{Compare to other models?}
\Cref{fig:dynamic_graphical_model} shows the resulting graphical model which divides the generative process for every data point in the application of the latent functions on the left side and the assignment process on the right side.
Interdependencies between the data points are introduced through the Gaussian process priors on $\rv{f_n^{\pix{k}}}$ and $\rv{\alpha_n^{\pix{k}}}$ and depend on the hyperparameters $\mat{\theta} = \Set*{\mat{\theta^{\pix{k}}}, \mat{\theta_\alpha^{\pix{k}}}, \sigma^{\pix{k}}}_{k=1}^K$.
\section{Variational approximation}
Exact inference is intractable in this model.
Instead, we now formulate a variational approximation following ideas from~\parencite{hensman_gaussian_2013, salimbeni_doubly_2017}.
Because of the rich structure in our model, finding a variational lower bound which is both faithful and can be evaluated analytically is hard.
Instead, we formulate an approximation which factorizes along both the $K$ modes and $N$ data points.
This bound can be sampled efficiently and allows us to optimize both the models for the different modes $\Set*{f^{\pix{k}}}_{k=1}^K$ and our belief about the data assignments $\Set*{\mat{a_n}}_{n=1}^N$ simultaneously using stochastic optimization methods.
\subsection{Variational Lower Bound}
As first introduced by \textcite{titsias_variational_2009}, we augment all Gaussian processes in our model using sets of $M$ inducing points $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ and their corresponding function values $\mat{u^{\pix{k}}} = \Fun*{f^{\pix{k}}}{\mat{Z^{\pix{k}}}}$, the inducing variables.
We collect them as $\mat{Z} = \Set*{\mat{Z^{\pix{k}}}, \mat{Z_\alpha^{\pix{k}}}}_{k=1}^K$ and $\mat{U} = \Set*{\mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K$.
Taking the function $f^{\pix{k}}$ and its corresponding GP as an example, the inducing variables $\mat{u^{\pix{k}}}$ are jointly Gaussian with the latent function values $\mat{F^{\pix{k}}}$ of the observed data by the definition of GPs.
We follow \parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.
This formulation introduces the set $\Set*{\mat{Z^{\pix{k}}}, \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$ of variational parameters indicated in~\cref{fig:dynamic_graphical_model}.
To simplify notation we will drop the dependency on the inducing inputs $\mat{Z}$ in the following.
A central assumption of this approximation is that given enough well-placed inducing variables $\mat{u^{\pix{k}}}$, they are a sufficient statistic for the latent function values $\mat{F^{\pix{k}}}$.
This implies conditional independence of the $\mat{f_n^{\pix{k}}}$ given $\mat{u^{\pix{k}}}$ and $\mat{X}$.
With this assumption, the variational posterior of a single GP can be written as
\Variat*{\mat{F^{\pix{k}}} \given \mat{X}}
\int \Variat*{\mat{u^{\pix{k}}}}
\Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}}
\diff \mat{u^{\pix{k}}}
\int \Variat*{\mat{u^{\pix{k}}}}
\prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{x_n}}
\diff \mat{u^{\pix{k}}},
which can be evaluated analytically since it is a convolution of Gaussians.
This formulation simplifies inference within single Gaussian processes.
Next, we discuss how to handle correlations between the different modes and the assignment processes.
Given a set of assignments $\mat{A}$, this factorization along the data points is preserved in our model due to the assumed independence of the different modes in~\cref{eq:true_marginal_likelihood}.
The independence is lost if the assignments are unknown.
In this case, both the (a priori independent) assignment processes and the modes influence each other through data with unclear assignments.
Following the ideas of Doubly Stochastic Variational Inference (DSVI) presented by \textcite{salimbeni_doubly_2017} in the context of deep Gaussian processes, we maintain these correlations between different parts of the model while assuming factorization of the variational distribution.
That is, our variational posterior takes the factorized form
\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}
= &\Variat*{\mat{\alpha}, \Set*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}, \mat{u_\alpha^{\pix{k}}}}_{k=1}^K} \\
= &\prod_{k=1}^K\prod_{n=1}^N \Prob*{\mat{\alpha_n^{\pix{k}}} \given \mat{u_\alpha^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u_\alpha^{\pix{k}}}} \\
&\prod_{k=1}^K \prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{x_n}}\Variat*{\mat{u^{\pix{k}}}}.
Our goal is to recover a posterior for both the different modes and the data assignment.
To achieve this, instead of marginalizing $\mat{A}$, we consider the variational joint of $\mat{Y}$ and $\mat{A}$,
\MoveEqLeft\Variat*{\mat{Y}, \mat{A}} = \\
\Prob*{\mat{Y} \given \mat{F}, \mat{A}}
\Prob*{\mat{A} \given \mat{\alpha}}
\diff \mat{F} \diff \mat{\alpha},
which retains both the Gaussian likelihood of $\mat{Y}$ and the multinomial likelihood of $\mat{A}$ shown in \cref{eq:multinomial_likelihood}.
A lower bound $\Ell_{\text{DMGP}}$ for the log-joint $\log\Prob*{\mat{Y}, \mat{A} \given \mat{X}}$ of the DMGP is given by
\Ell_{\text{DMGP}} &= \Moment*{\E_{\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}}}{\log\frac{\Prob*{\mat{Y}, \mat{A}, \mat{F}, \mat{\alpha}, \mat{U} \given \mat{X}}}{\Variat*{\mat{F}, \mat{\alpha}, \mat{U}}}} \\
&= \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{f_n}}}}{\log \Prob*{\mat{y_n} \given \mat{f_n}, \mat{a_n}}} \\
&\quad + \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{\alpha_n}}}}{\log \Prob*{\mat{a_n} \given \mat{\alpha_n}}} \\
&\quad - \sum_{k=1}^K \KL{\Variat*{\mat{u^{\pix{k}}}}}{\Prob*{\mat{u^{\pix{k}}} \given \mat{Z^{\pix{k}}}}} \\
&\quad - \sum_{k=1}^K \KL{\Variat*{\mat{u_\alpha^{\pix{k}}}}}{\Prob*{\mat{u_\alpha^{\pix{k}}} \given \mat{Z_\alpha^{\pix{k}}}}}.
Due to the structure of~\cref{eq:variational_distribution}, the bound factorizes along the data, enabling stochastic optimization.
This bound has complexity $\Fun*{\Oh}{NM^2K}$ to evaluate.
\subsection{Optimization of the Lower Bound}
An important property of the variational bound for DSVI described in \parencite{salimbeni_doubly_2017} is that taking samples for single data points is straightforward and can be implemented efficiently.
Specifically, for some $k$ and $n$, samples $\mat{\hat{f}_n^{\pix{k}}}$ from $\Variat*{\mat{f_n^{\pix{k}}}}$ are independent of all other parts of the model and can be drawn using samples from univariate unit Gaussians using reparametrizations \parencite{kingma_variational_2015,rezende_stochastic_2014}.
While it would not be necessary to sample from the different modes since $\Variat*{\mat{F^{\pix{k}}}}$ can be computed analytically \parencite{hensman_gaussian_2013}, we apply the idea to the optimization of both the assignment processes $\mat{\alpha}$ and the assignments $\mat{A}$.
For $\mat{\alpha}$, the analytical propagation of uncertainties through the $\softmax$-renormalization and multinomial likelihoods is intractable but can easily be evaluated using sampling.
We optimize $\Ell_{\text{DMGP}}$ to simultaneously recover maximum likelihood estimates of the hyperparameters $\mat{\theta}$, the variational parameters $\Set*{\mat{Z}, \mat{m}, \mat{S}}$, and assignments $\mat{A}$.
For every $n$, we represent the belief about $\mat{a_n}$ as a $K$-dimensional discrete distribution $\Variat*{\mat{a_n}}$.
This distribution models the result of drawing a sample from $\Multinomial*{\mat{a_n} \given \Fun{\softmax}{\mat{\alpha_n}}}$ during the generation of the data point $(\mat{x_n}, \mat{y_n})$.
Since we want to optimize $\Ell_{\text{DMGP}}$ using (stochastic) gradient descent, we need to employ a continuous relaxation to gain informative gradients of the bound with respect to the binary (and discrete) vectors $\mat{a_n}$.
One straightforward way to relax the problem is to use the current belief $\Variat*{\mat{a_n}}$ as parameters for a convex combination of the $\mat{f_n^{\pix{k}}}$, that is, to approximate $\mat{f_n} \approx \sum_{k=1}^K \Variat*{\mat{a_n^{\pix{k}}}}\mat{\hat{f}_n^{\pix{k}}}$.
Using this relaxation causes multiple problems in practice.
Most importantly, explaining data points as mixtures of modes can substantially simplify the learning problem while violating the modelling assumption that every data point was generated using exactly one mode.
Because of this, special care must be taken during optimization to enforce the sparsity of $\Variat*{\mat{a_n}}$.
Predictive posterior.
Posterior assignment probability.
Semi-Bimodal data.
Semi bimodal attrib process.
To avoid this problem, we propose using a different relaxation based on additional stochasticity.
Instead of directly using $\Variat*{\mat{a_n}}$ to combine the $\mat{f_n^{\pix{k}}}$, we first draw a sample $\mat{\hat{a}_n}$ from a Concrete random variable as suggested by \textcite{maddison_concrete_2016} whose parameters are given by $\Variat*{\mat{a_n}}$.
Based on a temperature parameter $\lambda$, a Concrete random variable enforces sparsity but is also continuous and yields informative gradients using automatic differentiation.
Samples from a Concrete random variable are unit vectors and for $\lambda \to 0$ their distribution approaches a discrete distribution.
Our approximate evaluation of the bound in \cref{eq:variational_bound} during optimization has multiple sources of stochasticity, all of which are unbiased.
First, we approximate the expectations using Monte Carlo samples $\mat{\hat{f}_n^{\pix{k}}}$, $\mat{\hat{\alpha}_n^{\pix{k}}}$, and $\mat{\hat{a}_n}$.
And second, the factorization of the bound along the data allows us to use mini-batches for optimization \parencite{salimbeni_doubly_2017, hensman_gaussian_2013}.
\subsection{Approximate Predictions}
Predictions for a test location $\mat{x_\ast}$ are mixtures of $K$ independent Gaussians, given by
\Variat*{\mat{f_\ast} \given \mat{x_\ast}}
&= \int \sum_{k=1}^K \Variat*{a_\ast^{\pix{k}} \given \mat{x_\ast}} \Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}} \diff \mat{a_\ast^{\pix{k}}}\\
&\approx \sum_{k=1}^K \hat{a}_\ast^{\pix{k}} \mat{\hat{f}_\ast^{\pix{k}}}.
The predictive posteriors of the $K$ modes $\Variat*{\mat{f_\ast^{\pix{k}}} \given \mat{x_\ast}}$ are given by $K$ independent shallow Gaussian processes and can be calculated analytically \parencite{hensman_gaussian_2013}.
Samples from the predictive density over $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ can be obtained by sampling from the Gaussian process posteriors $\Variat*{\mat{\alpha_\ast^{\pix{k}}} \given \mat{x_\ast}}$ and renormalizing the resulting vector $\mat{\alpha_\ast}$ using the $\softmax$-function.
The distribution $\Variat*{\mat{a_\ast} \given \mat{x_\ast}}$ reflects the model's belief about how many and which of the $K$ modes are relevant at the test location $\mat{x_\ast}$.
In this section we investigate the behavior of the DMGP model in multiple regression settings.
First, we apply the DMGP to an artificial data set and showcase how the different components of the model interact to identify unimodal and multimodal parts of the input space.
Second, we show how different priors on the different modes can be used to separate a signal from unrelated noise.
\todo{Reformulate in accordance with the introduction}And third, we investigate a data set which contains observations of two independent dynamical systems mixed together and show how the DMGP can recover information about both systems.
We use an implementation of DMGP in TensorFlow \parencite{tensorflow2015-whitepaper} based on GPflow \parencite{matthews_gpflow_2017} and the implementation of doubly stochastic variational inference \parencite{salimbeni_doubly_2017}.
\subsection{Artificial data set}
To demonstrate inference in our model, we begin with an experiment based on an artificial data set.
The data, together with recovered posterior attributions, can be seen in \cref{fig:semi_bimodal:b}.
We uniformly sampled 350 data points in the interval $ x \in [-2\pi, 2\pi]$ and obtain $y$ as $y = \Fun{\sin}{x} - \delta \cdot 2 \Fun{\exp}{-0.5 \cdot (x-2)^2} + \epsilon$ with a Bernoulli-distributed $\delta \sim \Fun{\Ber}{0.5}$ to introduce a multi-modality and additive independent noise $\epsilon \sim \Gaussian*{0, 0.005^2}$.
We use squared exponential kernels as priors for both for the $f^{\pix{k}}$ and $\alpha^{\pix{k}}$ and $M=25$ pseudo-inputs in every GP.
\Cref{fig:semi_bimodal,fig:semi_bimodal:c} show the posterior of a bimodal DMGP applied to the data which correctly identified the underlying functions.
\Cref{fig:semi_bimodal:b} shows the posterior belief about about the assignments $\mat{A}$ and illustrates that DMGP separated the input space in a part which is unimodal where all points have been assigned to the same mode and a bimodal part where both modes are necessary to explain the data.
This distinction is explicitly represented in the model via the assignment processes $\mat{\alpha}$ shown in \cref{fig:semi_bimodal:c}.
The model has learned that the second mode is irrelevant at, say, $x=-5$ and predictions using \cref{eq:predictive_posterior} simplify to a standard GP posterior.
The DMGP is implicitly incentivsed to explain the data using only one mode if possible through the likelihood term of the inferred $\mat{a_n}$ in \cref{eq:variational_bound}.
At $x = -10$ it can be seen that both the two modes and the assignment processes start reverting to their respective priors away from the data.
\subsection{Robust Regression}
Choicenet joint.
Choicenet attrib.
Choicenet data.
Note: For 0\% outliers our model reverts to a single GP and therefore the RMSE is the same.
In the posterior, $\mat{\alpha^{\pix{2}}} = 0$.
Choicenet results.
Outliers & DMGP (MLL) & DMGP (RMSE) & CN & MDN & MLP & GPR & LGPR & RGPR \\
0\,\% & 2.86 & \textbf{0.008} & 0.034 & 0.028 & 0.039 & \textbf{0.008} & 0.022 & 0.017 \\
20\,\% & 2.71 & \textbf{0.008} & 0.022 & 0.087 & 0.413 & 0.280 & 0.206 & 0.013 \\
40\,\% & 2.12 & \textbf{0.005} & 0.018 & 0.565 & 0.452 & 0.447 & 0.439 & 1.322 \\
60\,\% & 0.874 & 0.031 & \textbf{0.023} & 0.645 & 0.636 & 0.602 & 0.579 & 0.738 \\
80\,\% & 0.126 & 0.128 & \textbf{0.084} & 0.778 & 0.829 & 0.779 & 0.777 & 1.523 \\
\item Our results for the choicenet data + comparison
\subsection{Cartpole data}
\item Two different cartpole instances, one with a shorter-than-normal pole.
\item Thrown together in a single data set
\item We can train a model which separates the two and gives very good performance
\item Compare to standard GP (bad), Bayesian NN?
\item Our model ist the best model there ever was or will be.