
@@ 28,10 +28,10 @@

28

28

\aistatsaddress{ Siemens AG \\ Technical University of Munich \And Siemens AG \And Siemens AG \\ Technical University of Munich \And University of Bristol }]

29

29


30

30

\begin{abstract}

31


 We propose a novel Bayesian approach to modelling multimodal data generated by multiple independent processes, simultaneously solving the data assoiciation and induced supervised learning problems.


31

+ We propose a novel Bayesian approach to modelling multimodal data generated by multiple independent processes, simultaneously solving the data association and induced supervised learning problems.

32

32

Underpinning our approach is the use of Gaussian process priors used to encode structure both on the functions and the associations themselves.

33

33

The association of samples and functions are determined taking both inputs and outputs into account while also obtaining a posterior belief about the relevance of the global components throughout the input space.

34


 We present an efficient learning scheme based on doubly stochastic variational inference and discuss how the model can be extended to deep Gaussian process priors.


34

+ We present an efficient learning scheme based on doubly stochastic variational inference and discuss how it can be applied deep Gaussian process priors.

35

35

We show results for an artificial data set, a noise separation problem and a multimodal regression problem based on the cartpole benchmark.

36

36

\end{abstract}

37

37



@@ 42,7 +42,7 @@ Estimating a function from data is a central aspect of machine learning and a ho

42

42

Fundamentally, a function is an object relating an input value to a single output value, often represented as elements of finitedimensional vector spaces.

43

43

However, for some tasks, not all relevant input dimensions can be observed, meaning that for each input location, multiple outputs are possible due to changing missing information.

44

44

One class of problem with these characteristics are dynamic systems which often can be in multiple states of operation and where this state itself is not observed.

45


Examples include faulty sensors which, at any time, might emit a correct reading or uninformative noise\todo{noisy data reference?}, or industrial systems with accumulating latent effects which can induce bifurcation or hysteresis \parencite{hein_benchmark_2017}.


45

+Examples include faulty sensors which, at any time, might emit a correct reading or uninformative noise, or industrial systems with accumulating latent effects which can induce bifurcation or hysteresis \parencite{hein_benchmark_2017}.

46

46

In this work, we will investigate a data set derived from the cartpole benchmark which contains trajectories of two instances of the benchmark with different pole lengths.

47

47

This setup emulates an industrial system in which, for example due to wear or defective parts, the underlying dynamics change over time.

48

48

In this setting, we want to recover both joint predictions marginalizing the current state of operation but also informative models for these separate states.


@@ 50,7 +50,7 @@ In this setting, we want to recover both joint predictions marginalizing the cur

50

50

Estimating a model in this scenario is often referred to as a \emph{data association problem} \parencite{BarShalom:1987, Cox93areview} where both the different functions and the associations of the observations to a function need to be estimated.

51

51

A simple example of this can be seen in \cref{fig:semi_bimodal:b}, where no single function could have generated the data.

52

52

A slightly different view of the same problem is to consider the data to have been generated by a mixture of processes where we are interested to factorise the data into these components~\parencite{choi_choicenet_2018}.

53


The separation of underlying signal and a noise process is an application of the latter, where we consider certain observations to be noise and others to be signal.


53

+The separation of underlying signal and a noise process is an application of the latter, where we consider certain observations to be noise and others to be signal\todo{noisy data reference?}.

54

54


55

55

Early approaches to explaining data using multiple generative processes is based on separating the input space and training local expert models explaining easier subtasks~\parencite{jacobs_adaptive_1991,tresp_mixtures_2001, rasmussen_infinite_2002}.

56

56

The assignment of data points to local experts is handled by a gating network, which learns a function from the inputs to assignment probabilities.


@@ 147,7 +147,7 @@ The prior on the assignments $\mat{A}$ is given by marginalizing the $\mat{\alph

147

147

Modelling the relationship between the input and the associations allows us to efficiently model data which, for example, is unimodal in some parts of the input space and bimodal in others.

148

148

A simple smoothness prior will encode a belief for how quickly we believe the components switch across the input domain.

149

149


150


Since the GPs of the $\mat{\alpha^{\pix{k}}}$ use a zero mean function, our prior assumption is a uniform distribution between the different modes everywhere in the input space.


150

+Since the GPs of the $\mat{\alpha^{\pix{k}}}$ use a zero mean function, our prior assumption is a uniform distribution of the different modes everywhere in the input space.

151

151

If inference on the $\mat{a_n}$ reveals that, say, all data points at similar positions in the input space can be explained by the same $\nth{k}$ mode, the belief about $\mat{\alpha}$ can be adjusted to make a nonuniform mode distribution favorable at this position, thereby increasing the likelihood via $\Prob*{\mat{A} \given \mat{X}}$.

152

152

This mechanism introduces an incentive for the model to use as few modes as possible to explain the data and allows us to predict a relative importance of the modes when calculating the posterior of new observations $\mat{x^\ast}$.

153

153



@@ 166,7 +166,7 @@ Exact inference is intractable in this model.

166

166

Instead, we now formulate a variational approximation following ideas from~\parencite{hensman_gaussian_2013, salimbeni_doubly_2017}.

167

167

Because of the rich structure in our model, finding a variational lower bound which is both faithful and can be evaluated analytically is hard.

168

168

Instead, we formulate an approximation which factorizes along both the $K$ modes and $N$ data points.

169


This bound can be sampled efficiently and allows us to optimize both the models for the different modes $\Set*{f^{\pix{k}}}_{k=1}^K$ and our belief about the data assignments $\Set*{\mat{a_n}}_{n=1}^N$ simultaneously using stochastic optimization methods.


169

+This bound can be sampled efficiently and allows us to optimize both the models for the different modes $\Set*{f^{\pix{k}}}_{k=1}^K$ and our belief about the data assignments $\Set*{\mat{a_n}}_{n=1}^N$ simultaneously using stochastic optimization.

170

170


171

171

\subsection{Variational Lower Bound}

172

172

\label{subsec:lower_bound}


@@ 175,11 +175,11 @@ We collect them as $\mat{Z} = \Set*{\mat{Z^{\pix{k}}}, \mat{Z_\alpha^{\pix{k}}}}

175

175

Taking the function $f^{\pix{k}}$ and its corresponding GP as an example, the inducing variables $\mat{u^{\pix{k}}}$ are jointly Gaussian with the latent function values $\mat{F^{\pix{k}}}$ of the observed data by the definition of GPs.

176

176

We follow \parencite{hensman_gaussian_2013} and choose the variational approximation $\Variat*{\mat{F^{\pix{k}}}, \mat{u^{\pix{k}}}} = \Prob*{\mat{F^{\pix{k}}} \given \mat{u^{\pix{k}}}, \mat{X}, \mat{Z^{\pix{k}}}}\Variat*{\mat{u^{\pix{k}}}}$ with $\Variat*{\mat{u^{\pix{k}}}} = \Gaussian*{\mat{u^{\pix{k}}} \given \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$.

177

177

This formulation introduces the set $\Set*{\mat{Z^{\pix{k}}}, \mat{m^{\pix{k}}}, \mat{S^{\pix{k}}}}$ of variational parameters indicated in~\cref{fig:dynamic_graphical_model}.

178


To simplify notation we will drop the dependency on the inducing inputs $\mat{Z}$ in the following.


178

+To simplify notation we drop the dependency on $\mat{Z}$ in the following.

179

179


180

180

A central assumption of this approximation is that given enough wellplaced inducing variables $\mat{u^{\pix{k}}}$, they are a sufficient statistic for the latent function values $\mat{F^{\pix{k}}}$.

181

181

This implies conditional independence of the $\mat{f_n^{\pix{k}}}$ given $\mat{u^{\pix{k}}}$ and $\mat{X}$.

182


With this assumption, the variational posterior of a single GP can be written as,


182

+The variational posterior of a single GP can then be written as,

183

183

\begin{align}

184

184

\begin{split}

185

185

\Variat*{\mat{F^{\pix{k}}} \given \mat{X}}


@@ 430,42 +430,27 @@ At $x = 10$ both the two modes and the assignment processes start reverting to

430

430

\begin{subfigure}{.495\linewidth}

431

431

\centering

432

432

\includestandalone{figures/choicenet_joint_40}

433


 \caption{

434


 \label{fig:choicenet:a}

435


 Joint posterior with 40\,\% outliers.

436


 }

437

433

\end{subfigure}

438

434

\hfill

439

435

\begin{subfigure}{.495\linewidth}

440

436

\centering

441

437

\includestandalone{figures/choicenet_attrib_40}

442


 \caption{

443


 \label{fig:choicenet:b}

444


 Assignment probabilities with 40\,\% outliers.

445


 }

446

438

\end{subfigure}

447


 \\[.5\baselineskip]


439

+ \\

448

440

\begin{subfigure}{.495\linewidth}

449

441

\centering

450

442

\includestandalone{figures/choicenet_joint}

451


 \caption{

452


 \label{fig:choicenet:c}

453


 Joint posterior with 60\,\% outliers.

454


 }

455

443

\end{subfigure}

456

444

\hfill

457

445

\begin{subfigure}{.495\linewidth}

458

446

\centering

459

447

\includestandalone{figures/choicenet_attrib}

460


 \caption{

461


 \label{fig:choicenet:d}

462


 Assignment probabilities with 60\,\% outliers.

463


 }

464

448

\end{subfigure}

465

449

\captionof{figure}{

466

450

\label{fig:choicenet}

467


 The MDGP posterior on the ChoiceNet data set with 40\,\% outliers (upper row) and 60\,\% outliers (lower row).

468


 The bimodal MDGP identifies the underlying signal perfectly up to 40\,\% outliers.


451

+ MDGP on the ChoiceNet data set with 40\,\% outliers (upper row) and 60\,\% outliers (lower row).


452

+ We show the joint posterior (left) and assignment probabilities (right).


453

+ The bimodal MDGP identifies the signal perfectly up to 40\,\% outliers.

469

454

For 60\,\% outliers, some of the noise is interpreted as signal, but the latent function is still recovered.

470

455

}

471

456

\end{figure*}
