Browse Source

Add notes for the variational approximation

Markus Kaiser 3 years ago
  1. 82
  2. 1


@ -30,7 +30,7 @@
\section{The Dynamic Multi-Modal Gaussian Process Model}
@ -51,14 +51,14 @@
We use $K$ modes, each with an independent (deep) GP and a likelihood.
The modes are unordered.
\item[Mode models]
We denote the $k$-th latent GP function value associated with the $n$-th data pint as $\rv{f_n^{\pix{k}}}$.
We denote the $\nth{k}$ latent GP function value associated with the $\nth{n}$ data point as $\rv{f_n^{\pix{k}}}$.
Together with the likelihood we get mode-specific observations $\rv{y_n^{\pix{k}}}$.
We collect these values as $\mat{F^{\pix{k}}} = \left( \rv{f_1^{\pix{k}}}, \ldots, \rv{f_N^{\pix{k}}} \right) \in \R^{N \times D}$ to get a matrix containing all function values of the data and further collect those as $\mat{F} = \left( \mat{F^{\pix{1}}, \ldots, F^{\pix{K}}} \right) \in \R^{K \times N \times D}$ to have a matrix of all function values of all modes.
We do the same for $\mat{Y}$.
\item[Data assignments]
Assigned to every data point $\mat{x_n}$ we have a mode-assignment vector $\mat{a_n} \in \R^K$.
This vector is a unity vector, that is, it has exactly one nonzero entry and has norm $1$.
For notational consistency, we use $\mat{a_n^{\pix{k}}}$ to reference the $k$-th value in the vector.
For notational consistency, we use $\mat{a_n^{\pix{k}}}$ to reference the $\nth{k}$ value in the vector.
Just as before we denote $\mat{A} = \left( \mat{a_1}, \ldots, \mat{a_N} \right)$.
\item[Assignment probabilities]
Because we assume the $\mat{a_n}$ are latent we will represent our Bayesian belief via distributions over unity vectors, that is, a batch of multinomials (denoted as $\Multi$ similar to $\Norm$) with sample length 1.
@ -83,6 +83,9 @@
Similarly, as currently formulated, each mode is homoskedastic.
We can easily add the same construct we use for the $\mat{\alpha^{\pix{k}}}$ to the noise variance and get heteroskedastic GPs.
I did not do this here as I wanted to keep the model simple, but maybe we want to change this?
\item[Predictive posterior]
It is important to note that we do not only recover assignments of observed data to the different modes but through the assignment GPs we can also predict the relative importance of the different modes at every point in the input space.
This gives us a measure for multimodality and a much richer posterior!
Joint marginal likelihood
@ -104,12 +107,83 @@ with
\Prob*{\mat{A} \given \mat{X}} &=
\Multinomial*{\mat{A} \given \softmax\left(\rv{\alpha^{\pix{1}}}, \ldots, \rv{\alpha^{\pix{K}}}\right)} \\
\Multinomial*{\mat{A} \given \Fun{\softmax}{\rv{\alpha^{\pix{1}}}, \ldots, \rv{\alpha^{\pix{K}}}}} \\
&\qquad \prod_{k=1}^K \GP\Cond{\rv{\alpha^{\pix{k}}} \given 0, k_\alpha^{\pix{k}}}
\diff \rv{\alpha^{\pix{k}}}
\section{Variational approximation}
For the variational approximation, we augment every GP in the model along the lines of standard SVGP.
That is, using $\mat{f^{\pix{k}}}$ as an example, we add a set of $M$ inducing inputs $\mat{Z^{\pix{k}}} = \left(\mat{z_1^{\pix{k}}}, \ldots, \mat{z_M^{\pix{k}}}\right)$ with $\mat{u^{\pix{k}}} = \Fun*{f}{\mat{Z^{\pix{k}}}}$ and $\Variat*{\mat{f^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{f^{\pix{k}}} \given \mat{u^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ and finally $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.
We thus have global parameters for every GP in the model.
Simplifying the notation somewhat, for every $k$ and analogously for the $\mat{\alpha^{\pix{k}}}$ we get:
\Variat*{\mat{F^{\pix{k}}} \given \mat{X}} &= \int \Variat*{\mat{u^{\pix{k}}}}
\prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}}
\diff \mat{u^{\pix{k}}}
Note the factorization along the data within each layer.
\paragraph{Doubly Stochastic Variational Inference}
The problem for deep GPs is in correlation between the layers.
With nested variational inference, you add variational distributions here as well.
DSVI avoids this.
Instead, we formulate a bound which has no analytic solution but which still factorizes along the data, thus enabling us to use sampling approaches.
The main assumption about the joint variational distribution is that it factorizes along the different GPs, that is
\MoveEqLeft[0.5]\Variat*{\mat{F}, \mat{U}, \mat{\alpha}} = \\
&= \Variat*{\Set*{\mat{F^{\pix{k}}}, \mat{U^{\pix{k}}}, \mat{\alpha^{\pix{k}}}, \mat{U_\alpha^{\pix{k}}}}_{k=1}^K} \\
&= \prod_{k=1}^K \Prob*{\mat{F^{\pix{k}}} \given \mat{U^{\pix{k}}}}\Variat*{\mat{U^{\pix{k}}}}\Prob*{\mat{\alpha^{\pix{k}}} \given \mat{U_\alpha^{\pix{k}}}}\Variat*{\mat{U_\alpha^{\pix{k}}}}.
Note that this is not such a bad assumption as given correct assignments of the data points, the modes factorize anyway \todo{Should think about that some more}.
Skipping a bit of the math, this factorization conserves the fact that the marginal factorizes along the data, that is
\MoveEqLeft\Variat*{\mat{Y} \given \mat{X}} = \\
\Prob*{\mat{Y} \given \mat{F}, \mat{A}}
\Variat*{\mat{F} \given \mat{X}}
\Variat*{\mat{A} \given \mat{X}}
\diff \mat{A} \diff \mat{F},
with the standard likelihood term and
\MoveEqLeft[1]\Variat*{\mat{F} \given \mat{X}} = \\
&\prod_{k=1}^K \int \Variat*{\mat{u^{\pix{k}}}}\prod_{n=1}^N \Prob*{\mat{f_n^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}} \diff \mat{u^{\pix{k}}}
\end{split} \\
\MoveEqLeft[1]\Variat*{\mat{A} \given \mat{X}} = \\
\prod_{n=1}^{N} \Multinomial*{\mat{a_n} \given \Fun*{\softmax}{\rv{\alpha_n^{\pix{1}}}, \ldots, \rv{\alpha_n^{\pix{K}}}}} \\
&\quad \prod_{k=1}^K \int \Variat*{\mat{u_\alpha^{\pix{k}}}} \Prob*{\mat{\alpha_n^{\pix{k}}} \given \mat{u_\alpha^{\pix{k}}}, \mat{X}} \diff \mat{u_\alpha^{\pix{k}}}\diff \mat{\alpha_n^{\pix{k}}}
It is important to note that we only need to use Jensen's inequality to get into the outermost integrals, we do not care about the product of integrals in $\Variat*{\mat{A} \given \mat{X}}$.
We can sample from this by sampling from independent Gaussians for every data point (see DSVI paper).
Specifically, we can sample from the attribution-process the same way.\todo{We should explain this in the paper}
\paragraph{Variational bound}
The final variational bound is quite close to the version from DSVI.
\Ell &= \Moment*{\E_{\Variat*{\mat{F}, \mat{A}, \mat{U}}}}{\log\frac{\Prob*{\mat{Y}, \mat{F}, \mat{U}, \mat{A}}}{\Variat*{\mat{F}, \mat{A}, \mat{U}}}} \\
&= \sum_{n=1}^N \Moment*{\E_{\Variat*{\mat{f_n}, \mat{a_n}}}}{\log \Prob*{\mat{y_n} \given \mat{f_n}, \mat{a_n}}} \\
&\quad - \sum_{k=1}^K \KL{\Variat*{\mat{U^{\pix{k}}}}}{\Prob*{\mat{U^{\pix{k}}}}} \\
&\quad - \sum_{k=1}^K \KL{\Variat*{\mat{U_\alpha^{\pix{k}}}}}{\Prob*{\mat{U_\alpha^{\pix{k}}}}}
Keep in mind that we cannot solve this analytically but we can sample it easily.


@ -39,6 +39,7 @@
% Probabilities