
@@ 43,12 +43,11 @@

43

43

\maketitle

44

44


45

45

\begin{abstract}

46


 \todo{Reformulate abstract?}

47


 We present a Bayesian extension to convolution processes which defines a representation between multiple functions via an embedding in a shared latent space.


46

+ We propose a novel and Bayesian approach to finding nonlinear alignments of time series based on latent shared information.

48

47

The proposed model allows for both arbitrary alignments of the inputs and nonparametric output warpings to transform the observations.

49

48

This gives rise to multiple deep Gaussian process models connected via latent generating processes.

50


 We derive an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.

51


 The method is applied to both an artificial data set and measurements of a pair of wind turbines.


49

+ We present an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.


50

+ The method is applied to an artificial data set and realworld data of two wind turbines.

52

51

\end{abstract}

53

52


54

53



@@ 57,9 +56,9 @@ Many realworld systems are inherently hierarchical and connected.

57

56

Ideally, a machine learning method should model and recognize such dependencies.

58

57

Take wind power production, which is one of the major providers for renewable energy today, as an example:

59

58

To optimize the efficiency of a wind turbine the speed and pitch have to be controlled according to the local wind conditions (speed and direction).

60


In a wind farm typically one or several turbines are equipped with sensors for wind speed and direction.


59

+In a wind farm turbines are typically equipped with sensors for wind speed and direction.

61

60

The goal is to use these sensor data to produce accurate estimates and forecasts of the wind conditions at every turbine in the farm.

62


For the ideal case of a homogeneous and very slowly changing wind field, the wind conditions at each geometrical position in a wind farm can be estimated using the propagation times (time warps) computed from geometry, wind speed, and direction.


61

+For the ideal case of a homogeneous and very slowly changing wind field, the wind conditions at each geometrical position in a wind farm can be estimated using the propagation times (time warps) computed from geometry, wind speed, and direction \parencite{soleimanzadeh_controller_2011,bitar_coordinated_2013,schepers_improved_2007}.

63

62

In the real world, however, wind fields are not homogeneous, exhibit global and local turbulences, and interfere with the turbines and the terrain inside and outside the farm.

64

63

This makes it extremely difficult to construct accurate analytical models of wind propagation in a farm.

65

64

Also, standard approaches for extracting such information from data, e.g.\ generalized time warping \parencite{zhou_generalized_2012}, fail at this task because they rely on a high signal to noise ratio.


@@ 76,7 +75,7 @@ This MOGP acts as an interface to share information between the different GPs w

76

75

The paper has the following contributions:

77

76

In \cref{sec:model}, we propose a hierarchical, warped and aligned multioutput Gaussian process (AMOGP).

78

77

In \cref{sec:variational_approximation}, we present an efficient learning scheme via an approximation to the marginal likelihood which allows us to fully exploit the regularization provided by our structure, yielding highly interpretable results.

79


We show these properties for an artificial data set and observations for a pair of wind turbines in \cref{sec:experiments}.


78

+We show these properties for an artificial data set and for realworld data of two wind turbines in \cref{sec:experiments}.

80

79


81

80


82

81

\section{Model Definition}


@@ 84,7 +83,7 @@ We show these properties for an artificial data set and observations for a pair

84

83

We are interested in formulating shared priors over a set of functions $\Set{f_d}_{d=1}^D$ using GPs, thereby directly parameterizing their interdependencies.

85

84

In a traditional GP setting, multiple outputs are considered conditionally independent given the inputs, which significantly reduces the computational cost but also prevents the utilization of shared information.

86

85

Such interdependencies can be formulated via \emph{convolution processes (CPs)} as proposed by \textcite{boyle_dependent_2004}, a generalization of the \emph{linear model of coregionalization (LMC)} \parencite{journel_mining_1978,coburn_geostatistics_2000}.

87


In the CP framework, the output functions are the result of a convolution of the latent processes with smoothing kernel functions $T_{d,r}$ for each output $f_d$ and latent process $w_r$, it holds that $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x}  \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.


86

+In the CP framework, the output functions are the result of a convolution of the latent processes with smoothing kernel functions $T_{d,r}$ for each output $f_d$ and latent process $w_r$, defined as $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x}  \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.

88

87


89

88

In this model, the convolutions of the latent processes generating the different outputs are all performed around the same point $\mat{x}$.

90

89

We generalize this by allowing different \emph{alignments} of the observations which depend on the position in the input space.


@@ 101,7 +100,7 @@ The final model is then given by

101

100

\end{align}

102

101

where $f_d$ is equal to the CP model above and $a_d$ and $g_d$ are the respective alignment and warping functions and $\mat{\epsilon_d} \sim \Gaussian{0, \sigma_{y, d}^2\Eye}$ is a noise term.

103

102

This encodes the generative process described above:

104


For every turbine $\rv{y_d}$, observations at times $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the latent wind front in $f_d$, imposing turbinespecific components $g_d$ and adding noise through $\rv{\epsilon_d}$.


103

+For every turbine $\rv{y_d}$, observations at positions $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the latent wind front in $f_d$, imposing turbinespecific components $g_d$ and adding noise through $\rv{\epsilon_d}$.

105

104


106

105

Because we assume independence between $a_d$ and $g_d$ across outputs, we use GP priors of the form $a_d \sim \GP(\id, k_{a, d})$ and $g_d \sim \GP(\id, k_{g, d})$.

107

106

By setting the prior mean to the identity function $\id(x) = x$, we use the standard CP model as our default assumption.


@@ 122,7 +121,7 @@ With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyper pa

122

121

\MoveEqLeft[1] \Moment{\cov}{f_d(\mat{x}), f_{d^\prime}(\mat{x^\prime})} = \sum_{r=1}^R \frac{(2\pi)^{\frac{K}{2}}\sigma_{d, r}\sigma_{d^\prime, r}}{\prod_{k=1}^K \hat{\ell}_{d, d^\prime, r, k}\inv} \Fun*{\exp}{\frac{1}{2} \sum_{k=1}^K \frac{(x_k  x^\prime_k)^2}{\hat{\ell}_{d, d^\prime, r, k}^2}},

123

122

\end{split}

124

123

\end{align}

125


with $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\ell_{d, r, k}^2 + \ell_{d^\prime, r, k}^2}$.


124

+where $\mat{x}$ is $K$dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\ell_{d, r, k}^2 + \ell_{d^\prime, r, k}^2}$.

126

125


127

126

\begin{figure}[t]

128

127

\centering


@@ 214,11 +213,13 @@ For notational simplicity, we only consider the case of one single latent proces

214

213

Since the latent processes are independent, the results can easily be generalized to multiple processes.

215

214

Then, $\psi_f$ is given by

216

215

\begin{align}


216

+ \label{eq:psi0}

217

217

\psi_f = \Moment*{\E_{\Variat{\rv{a}}}}{\Fun*{\tr}{\mat{K_{ff}}}} = \sum_{n=1}^N \hat{\sigma}_{nn}^2.

218

218

\end{align}

219

219

Similar to the notation $\Fun{\hat{f}}{\cdot}$, we use the notation $\hat{\sigma}_{nn^\prime}$ to mean the variance term associated with the covariance function $\Moment{\cov}{\Fun{\hat{f}}{\mat{a_n}}, \Fun{\hat{f}}{\mat{a_{n^\prime}}}}$ as shown in \cref{eq:dependent_kernel}.

220

220

The expectation $\mat{\Psi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{fu}}}$ connecting the alignments and the pseudo inputs is given by

221

221

\begin{align}


222

+ \label{eq:psi1}

222

223

\begin{split}

223

224

\left( \mat{\Psi_f} \right)_{ni} &= \hat{\sigma}_{ni}^2 \sqrt{\frac{(\mat{\Sigma_a})_{nn}\inv}{\hat{\ell}_{ni} + (\mat{\Sigma_a})_{nn}\inv}}

224

225

\exp\left(\frac{1}{2} \frac{(\mat{\Sigma_a})_{nn}\inv\hat{\ell}_{ni}}{(\mat{\Sigma_a})_{nn}\inv + \hat{\ell}_{ni}} \left((\mat{\mu_a})_n  \mat{u_i}\right)^2\right),


@@ 227,6 +228,7 @@ The expectation $\mat{\Psi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{fu}}}$ co

227

228

where $\hat{\ell}_{ni}$ is the combined length scale corresponding to the same kernel as $\hat{\sigma}_{ni}$.

228

229

Lastly, $\mat{\Phi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}$ connects alignments and pairs of pseudo inputs with the closed form

229

230

\begin{align}


231

+ \label{eq:psi2}

230

232

\begin{split}

231

233

\left( \mat{\Phi_f} \right)_{ij} &= \sum_{n=1}^N \hat{\sigma}_{ni}^2 \hat{\sigma}_{nj}^2 \sqrt{\frac{(\mat{\Sigma_a})_{nn}\inv}{\hat{\ell}_{ni} + \hat{\ell}_{nj} + (\mat{\Sigma_a})_{nn}\inv}}

232

234

\exp\left( \frac{1}{2} \vphantom{\left(\frac{\hat{\ell}}{\hat{\ell}}\right)^2} \frac{\hat{\ell}_{ni}\hat{\ell}_{nj}}{\hat{\ell}_{ni} + \hat{\ell}_{nj}} (\mat{u_i}  \mat{u_j})^2 \right. \\


@@ 246,7 +248,7 @@ The graphical model shown in \cref{fig:graphical_model_supervised} illustrates t

246

248

This CP acts as an interface to share information between the different GPs which are otherwise conditionally independent.

247

249

This modellingchoice introduces a new quality to the model when compared to standard deep GPs with multiple output dimensions, since the latter are not able in principle to learn dependencies between the different outputs.

248

250

Compared to standard multioutput GPs, the AMOGP introduces more flexibility with respect to the shared information.

249


CPs as introduced by \textcite{boyle_dependent_2004} make strong assumptions about the relative alignments of the different outputs, that is, they assume constant timeoffsets.


251

+CPs make strong assumptions about the relative alignments of the different outputs, that is, they assume constant timeoffsets.

250

252

The AMOGP extends on this by introducing a principled Bayesian treatment of general nonlinear alignments $a_d$ on which we can place informative priors derived from the problem at hand.

251

253

Together with the warping layers $g_d$, our model can learn to share knowledge in an informative latent space learnt from the data.

252

254



@@ 254,7 +256,7 @@ In order to derive an inference scheme, we need the ability to propagate uncerta

254

256

We adapted the approach of nested variational compression by \textcite{hensman_nested_2014}, which is originally concerned with a single deep GP.

255

257

The approximation is expanded to handle multiple GPs at once, yielding the bound in \cref{eq:full_bound}.

256

258

The bound reflects the dependencies of the different outputs as the sharing of information between the different deep GPs is approximated through the shared inducing variables $\rv{u_{f,d}}$.

257


Our main contribution for the inference scheme is the derivation of a closedform solution for the $\Psi$statistics of the convolution kernel.


259

+Our main contribution for the inference scheme is the derivation of a closedform solution for the $\Psi$statistics of the convolution kernel in \cref{eq:psi0,eq:psi1,eq:psi2}.

258

260


259

261


260

262

\section{Experiments}


@@ 292,7 +294,7 @@ Our main contribution for the inference scheme is the derivation of a closedfor

292

294

\centering

293

295

\includestandalonewithpath{figures/toy_decomposition_ours}

294

296

\caption{

295


 AMOGP with (dependent) RBF kernels.


297

+ AMOGP (Ours) with (dependent) RBF kernels.

296

298

\label{fig:toy_model_decomposition:d}

297

299

}

298

300

\end{subfigure}


@@ 309,7 +311,7 @@ In this setting, we observe multiple time series $\T_i = (\mat{X_i}, \mat{y_i})$

309

311


310

312

We will first apply the AMOGP to an artificial data set in which we define a decomposed system of dependent time series by specifying a shared latent function generating the observations together with relative alignments and warpings for the different time series.

311

313

We will show that our model is able to recover this decomposition from the training data and compare the results to other approaches of modeling the data.

312


In the next section we focus on a real world data set of a neighbouring pair of wind turbines in a wind farm, where the model is able to recover a representation of the latent prevailing wind condition and the relative timings of wind fronts hitting the two turbines.


314

+Then we focus on a real world data set of a neighbouring pair of wind turbines in a wind farm, where the model is able to recover a representation of the latent prevailing wind condition and the relative timings of wind fronts hitting the two turbines.

313

315


314

316

\subsection{Artificial data set}

315

317

\label{subsec:artificial_example}


@@ 334,7 +336,7 @@ Otherwise, the model could easily get stuck in local minima like choosing the up

334

336

Additionally, our assumption of identity mean functions prevents pathological cases in which the complete model collapses to a constant function.

335

337


336

338

\Cref{fig:toy_model_decomposition:d} shows the AMOGP's recovered function decomposition and joint predictions.

337


The model succesfully recovered a shared latent dampened sine function, a sigmoid warping for the first time series and an approximate quadratic alignment function for the second time series.


339

+The model successfully recovered a shared latent dampened sine function, a sigmoid warping for the first time series and an approximate quadratic alignment function for the second time series.

338

340

In \cref{fig:toy_model_decomposition:a,fig:toy_model_decomposition:b,fig:toy_model_decomposition:c}, we show the training results of a standard GP, a multioutput GP and a threelayer deep GP on the same data.

339

341

For all of these models, we used RBF kernels and, in the case of the deep GP, applied priors similar to our model in order to avoid pathological cases.

340

342

In \cref{tab:toy_model_log_likelihoods} we report test loglikelihoods for the presented models, which illustrate the qualitative differences between the models.


@@ 358,7 +360,6 @@ Because of this, the model cannot recover the latent sine function and can only

358

360

The joint posterior for two time series $\rv{y_1}$ and $\rv{y_2}$ of power production for a pair of wind turbines.

359

361

The training data and dashed missing data are shown in grey.

360

362

The AMOGP recovers an uncertain relative alignment of the two time series.

361


 High uncertainty about the alignment is placed in areas where multiple explanations are plausible.

362

363

}

363

364

\end{figure}

364

365

\begin{figure}[t]


@@ 382,9 +383,8 @@ Because of this, the model cannot recover the latent sine function and can only

382

383

\end{subfigure}

383

384

\caption{

384

385

\label{fig:wind_samples}

385


 A comparison of noiseless samples drawn from a MOGP and the AMOGP, which achieve comparable test loglikelihoods.


386

+ A comparison of noiseless samples drawn from a MOGP and the AMOGP.

386

387

The MOGP smoothens the data and models local variation as noise, while our model picks up on local structure and predicts the maximum in the center.

387


 Samples drawn from our model show that it is certain about the maximum but uncertain about the timing due to multiple plausible alignments.

388

388

}

389

389

\end{figure}

390

390

\begin{table}[t]


@@ 394,7 +394,7 @@ Because of this, the model cannot recover the latent sine function and can only

394

394

Testloglikelihoods for the models presented in \cref{sec:experiments}.

395

395

}

396

396

\newcolumntype{Y}{>{\centering\arraybackslash}X}

397


 \begin{tabularx}{.95\linewidth}{rrYYYY}


397

+ \begin{tabularx}{\linewidth}{rrYYYY}

398

398

\toprule

399

399

Experiment & Test set & GP & MOGP & DGP & AMOGP (Ours) \\

400

400

\midrule


@@ 422,11 +422,10 @@ Modelling the turbinespecific parts of the signals is not the objective, so the

422

422

We use a squared exponential kernel as a prior for the alignment functions $\rv{a_i}$ and as smoothing kernels in $\rv{f}$.

423

423

For the given data set we can assume the output warpings $\rv{g_i}$ to be linear functions because there is only one dimension, the power generation, which in this data set is of similar shape for both turbines.

424

424

Again we encode a preference for alignments with slow dynamics with a prior on the length scales of $\rv{a_i}$.

425




426

425

As the signal has turbinespecific autoregressive components, plausible alignments are not unique.

427

426

To constrain the AMOGP, we want it to prefer alignments close to the identity function which we chose as a prior mean function.

428


In nonhierarchical models, this can be achieved by placing a prior on the kernel's variance preferring smaller $\sigma_a^2$.

429

427



428

+% In nonhierarchical models, this can be achieved by placing a prior on the kernel's variance preferring smaller $\sigma_a^2$.

430

429

% In our case, the posterior space of the alignment process $\rv{a}$ is a latent space we have not placed any prior on.

431

430

% The model can therefore choose the posterior distribution of $\rv{u_a}$ in a way to counteract the constrained scale of $\mat{K_{au}}$ and $\mat{K_{uu}}\inv$ in \cref{eq:augmented_joint}\todo{This equation does not exist anymore} and thereby circumvent the prior.

432

431

% To prevent this, we also place a prior on the mean of $\rv{u_a}$ to remove this degree of freedom.


@@ 434,7 +433,6 @@ In nonhierarchical models, this can be achieved by placing a prior on the kerne

434

433

\Cref{fig:wind_joint_model} shows the joint model learned from the data in which $a_1$ is chosen to be the identity function.

435

434

The possible alignments identified match the physical conditions of the wind farm.

436

435

For the given turbines, time offsets of up to six minutes are plausible and for most wind conditions, the offset is expected to be close to zero.

437


In order to calculate the posteriors, the uncertainty about the correct alignment is only propagated through the shared function for $\T_2$, leading to smoother predictions.

438

436

For areas where the alignment is quite certain however, the two time series are explained with comparable detail.

439

437

The model is able to recover unambiguous associations well and successfully places high uncertainty on the alignment in areas where multiple explanations are plausible due to the noisy signal.

440

438



@@ 443,19 +441,21 @@ This uncertainty is propagated through the shared function and results in higher

443

441

Because of the factorization in the model however, we can recover the uncertainties about the alignment and the shared latent function separately.

444

442

\Cref{fig:wind_samples} compares samples drawn from our model with samples drawn from a standard MOGP.

445

443

The MOGP does not handle shortterm dynamics and smooths the signal enough such that the nonlinear alignment can be approximated as constant.

446


This can lead to comparable testloglikelihoods, but since shortterm dynamics are explained as noise, the model cannot be used for predictions in this time scale.

447

444

AMOGP shows richer structure:

448


Samples show that it has learned that a maximum which is missing from the training data has to exist somewhere, but the uncertainty about the correct alignment due to the local turbulence means that different samples place the maximum at different locations in $\mat{X}$direction.


445

+Samples show that it has learned that a maximum which is missing in the training data has to exist somewhere, but the uncertainty about the correct alignment due to the local turbulence means that different samples place the maximum at different locations in $\mat{X}$direction.

449

446


450

447


451

448

\section{Conclusion}

452

449

\label{sec:conclusion}

453

450

\todo{Reformulate this}

454


We have presented how Gaussian processes with multiple outputs can be embedded in a hierarchy to find shared structure in latent spaces.


451

+We have proposed the warped and aligned multioutput Gaussian process (AMOGP), in which MOGPs are embedded in a hierarchy to find shared structure in latent spaces.

455

452

We extended convolution processes \parencite{boyle_dependent_2004} with conditionally independent Gaussian processes on both the input and output sides, giving rise to a highly structured deep GP model.

456

453

This structure can be used to both regularize the model and encode expert knowledge about specific parts of the system.

457

454

By applying nested variational compression \parencite{hensman_nested_2014} to inference in these models, we derived a variational lower bound which combines Bayesian treatment of all parts of the model with scalability via stochastic optimization.

458

455



456

+We compared the model with deep GPs and multioutput GPs on an artificial data set and showed how the richer modelstructure allows the AMOGP to pick up on latent structure the other approaches cannot model.


457

+We then applied the AMOGP to realworld data of two wind turbines and showed how the proposed hierarchy can be used to encode expert knowledge, recover humaninterpretable structure from the data and yield an informative model with uncertainties decomposed along the hierarchy.


458

+

459

459


460

460

\nocite{*}

461

461

\printbibliography
