
@@ 20,34 +20,37 @@

20

20

\author{\href{mailto:markus.kaiser@siemens.com}{Markus Kaiser}}

21

21


22

22

\author{

23


 Markus Kaiser\\

24


 Siemens AG\\

25


 Technical University of Munich\\

26


 \texttt{markus.kaiser@siemens.com}\\

27


 \And

28


 Clemens Otte\\

29


 Siemens AG\\

30


 \texttt{clemens.otte@siemens.com}\\

31


 \And

32


 Thomas Runkler\\

33


 Siemens AG\\

34


 Technical University of Munich\\

35


 \texttt{thomas.runkler@siemens.com}\\

36


 \And

37


 Carl Henrik Ek\\

38


 University of Bristol\\

39


 \texttt{carlhenrik.ek@bristol.ac.uk}\\


23

+ % Markus Kaiser\\


24

+ % Siemens AG\\


25

+ % Technical University of Munich\\


26

+ % \texttt{markus.kaiser@siemens.com}\\


27

+ % \And


28

+ % Clemens Otte\\


29

+ % Siemens AG\\


30

+ % \texttt{clemens.otte@siemens.com}\\


31

+ % \And


32

+ % Thomas Runkler\\


33

+ % Siemens AG\\


34

+ % Technical University of Munich\\


35

+ % \texttt{thomas.runkler@siemens.com}\\


36

+ % \And


37

+ % Carl Henrik Ek\\


38

+ % University of Bristol\\


39

+ % \texttt{carlhenrik.ek@bristol.ac.uk}\\


40

+ % NOTE: Fix metadata.


41

+ Anonymous\\

40

42

}

41

43


42

44

\begin{document}

43

45

\maketitle

44

46


45

47

\begin{abstract}

46


 We propose a novel and Bayesian approach to finding nonlinear alignments of time series based on latent shared information.


48

+ We propose a novel Bayesian approach to modelling nonlinear alignments of time series based on latent shared information.


49

+ We apply the method to the realworld problem of finding common structure in the sensor data of wind turbines introduced by the underlying latent and turbulent wind field.

47

50

The proposed model allows for both arbitrary alignments of the inputs and nonparametric output warpings to transform the observations.

48

51

This gives rise to multiple deep Gaussian process models connected via latent generating processes.

49

52

We present an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.

50


 The method is applied to an artificial data set and realworld data of two wind turbines.


53

+ We show results for an artificial data set and realworld data of two wind turbines.

51

54

\end{abstract}

52

55


53

56



@@ 59,12 +62,12 @@ To optimize the efficiency of a wind turbine the speed and pitch have to be cont

59

62

In a wind farm turbines are typically equipped with sensors for wind speed and direction.

60

63

The goal is to use these sensor data to produce accurate estimates and forecasts of the wind conditions at every turbine in the farm.

61

64

For the ideal case of a homogeneous and very slowly changing wind field, the wind conditions at each geometrical position in a wind farm can be estimated using the propagation times (time warps) computed from geometry, wind speed, and direction \parencite{soleimanzadeh_controller_2011,bitar_coordinated_2013,schepers_improved_2007}.

62


In the real world, however, wind fields are not homogeneous, exhibit global and local turbulences, and interfere with the turbines and the terrain inside and outside the farm.


65

+In the real world, however, wind fields are not homogeneous, exhibit global and local turbulences, and interfere with the turbines and the terrain inside and outside the farm and further, breaking sensors can lead to data loss.

63

66

This makes it extremely difficult to construct accurate analytical models of wind propagation in a farm.

64

67

Also, standard approaches for extracting such information from data, e.g.\ generalized time warping \parencite{zhou_generalized_2012}, fail at this task because they rely on a high signal to noise ratio.

65

68

Instead, we want to construct Bayesian nonlinear dynamic data based models for wind conditions and warpings which handle the stochastic nature of the system in a principled manner.

66

69


67


In this paper, we look at a generalization of this type of problem and propose a novel and Bayesian approach to finding nonlinear alignments of time series based on latent shared information.


70

+In this paper, we look at a generalization of this type of problem and propose a novel Bayesian approach to finding nonlinear alignments of time series based on latent shared information.

68

71

We view the power production of different wind turbines as the outputs of a multioutput Gaussian process (MOGP) \parencite{alvarez_kernels_2011} which models the latent wind fronts.

69

72

We embed this model in a hierarchy, adding a layer of nonlinear alignments on top and a layer of nonlinear warpings \parencites{NIPS2003_2481,lazarogredilla_bayesian_2012} below which increases flexibility and encodes the original generative process.

70

73

We show how the resulting model can be interpreted as a group of deep Gaussian processes with the added benefit of covariances between different outputs.


@@ 83,7 +86,7 @@ We show these properties for an artificial data set and for realworld data of t

83

86

We are interested in formulating shared priors over a set of functions $\Set{f_d}_{d=1}^D$ using GPs, thereby directly parameterizing their interdependencies.

84

87

In a traditional GP setting, multiple outputs are considered conditionally independent given the inputs, which significantly reduces the computational cost but also prevents the utilization of shared information.

85

88

Such interdependencies can be formulated via \emph{convolution processes (CPs)} as proposed by \textcite{boyle_dependent_2004}, a generalization of the \emph{linear model of coregionalization (LMC)} \parencite{journel_mining_1978,coburn_geostatistics_2000}.

86


In the CP framework, the output functions are the result of a convolution of the latent processes with smoothing kernel functions $T_{d,r}$ for each output $f_d$ and latent process $w_r$, defined as $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x}  \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.


89

+In the CP framework, the output functions are the result of a convolution of the latent processes $w_r$ with smoothing kernel functions $T_{d,r}$ for each output $f_d$, defined as $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x}  \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.

87

90


88

91

In this model, the convolutions of the latent processes generating the different outputs are all performed around the same point $\mat{x}$.

89

92

We generalize this by allowing different \emph{alignments} of the observations which depend on the position in the input space.


@@ 98,23 +101,18 @@ The final model is then given by

98

101

\begin{align}

99

102

\mat{y_d} &= g_d(f_d(a_d(\mat{X}))) + \mat{\epsilon_d},

100

103

\end{align}

101


where $f_d$ is equal to the CP model above and $a_d$ and $g_d$ are the respective alignment and warping functions and $\mat{\epsilon_d} \sim \Gaussian{0, \sigma_{y, d}^2\Eye}$ is a noise term.


104

+where $\mat{\epsilon_d} \sim \Gaussian{0, \sigma_{y, d}^2\Eye}$ is a noise term.

102

105

This encodes the generative process described above:

103


For every turbine $\rv{y_d}$, observations at positions $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the latent wind front in $f_d$, imposing turbinespecific components $g_d$ and adding noise through $\rv{\epsilon_d}$.


106

+For every turbine $\rv{y_d}$, observations at positions $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the front in $f_d$, imposing turbinespecific components $g_d$ and adding noise in $\rv{\epsilon_d}$.

104

107


105


Because we assume independence between $a_d$ and $g_d$ across outputs, we use GP priors of the form $a_d \sim \GP(\id, k_{a, d})$ and $g_d \sim \GP(\id, k_{g, d})$.

106


By setting the prior mean to the identity function $\id(x) = x$, we use the standard CP model as our default assumption.


108

+We assume independence between $a_d$ and $g_d$ across outputs and apply GP priors of the form $a_d \sim \GP(\id, k_{a, d})$ and $g_d \sim \GP(\id, k_{g, d})$.


109

+By setting the prior mean to the identity function $\id(x) = x$, the standard CP model is our default assumption.

107

110

During learning, the model can choose the different $a_d$ and $g_d$ in a way to reveal the independent shared latent processes $\Set{w_r}_{r=1}^R$ on which we also place GP priors $w_r \sim \GP(0, k_{u, r})$.

108

111

Similar to \textcite{boyle_dependent_2004}, we assume the latent processes to be independent white noise processes by setting $\Moment{\cov}{w_r(\mat{z}), w_{r^\prime}(\mat{z^\prime})} = \delta_{rr^\prime}\delta_{\mat{z}\mat{z^\prime}}$.

109


Under this prior, the $f_d$ are also GPs with zero mean and

110


\begin{align}

111


\begin{split}

112


 \MoveEqLeft \Moment{\cov}{f_d(\mat{x}), f_{d^\prime}(\mat{x^\prime})} = \sum_{r=1}^R \int T_{d,r}(\mat{x}  \mat{z}) T_{d^\prime,r}(\mat{x^\prime}  \mat{z}) \diff \mat{z}.

113


\end{split}

114


\end{align}


112

+Under this prior, the $f_d$ are also GPs with zero mean and $\Moment{\cov}{f_d(\mat{x}), f_{d^\prime}(\mat{x^\prime})} = \sum_{r=1}^R \int T_{d,r}(\mat{x}  \mat{z}) T_{d^\prime,r}(\mat{x^\prime}  \mat{z}) \diff \mat{z}$.

115

113


116


Using the squared exponential kernel as the smoothing kernel function for all $T_{d, r}$, the integral can be shown to have a closed form solution.

117


With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyper parameters associated with $T_{d,r}$, it is given by


114

+Using the squared exponential kernel for all $T_{d, r}$, the integral can be shown to have a closed form solution.


115

+With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the kernel hyper parameters associated with $T_{d,r}$, it is given by

118

116

\begin{align}

119

117

\label{eq:dependent_kernel}

120

118

\begin{split}


@@ 122,7 +120,6 @@ With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyper pa

122

120

\end{split}

123

121

\end{align}

124

122

where $\mat{x}$ is $K$dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\ell_{d, r, k}^2 + \ell_{d^\prime, r, k}^2}$.

125




126

123

\begin{figure}[t]

127

124

\centering

128

125

\begin{minipage}[t]{.45\textwidth}


@@ 131,7 +128,7 @@ where $\mat{x}$ is $K$dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\

131

128

\caption{

132

129

\label{fig:graphical_model_supervised}

133

130

Our proposed graphical model with variational parameters (blue).

134


 A shared explanation for multiple data sets is learned through a shared CP together with independent alignments and warpings.


131

+ A CP models shared information between multiple data sets with nonlinear alignments and warpings.

135

132

}

136

133

\end{minipage}

137

134

\hfill


@@ 149,21 +146,27 @@ where $\mat{x}$ is $K$dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\

149

146


150

147

\section{Variational Approximation}

151

148

\label{sec:variational_approximation}

152


Since exact inference in this model is intractable, we present a variational approximation to the model's true marginal likelihood and posterior in this section.


149

+Since exact inference in this model is intractable, we present a variational approximation to the model's marginal likelihood in this section.

153

150

A detailed derivation of the variational bound can be found in the supplementary material.

154

151

Analogously to $\mat{y}$, we denote the random vectors which contain the function values of the respective functions and outputs as $\rv{a}$ and $\rv{f}$.

155

152

The joint probability distribution of the data can then be written as

156

153

\begin{align}

157

154

\begin{split}

158

155

\label{eq:full_model}

159


 \Prob{\rv{y}, \rv{f}, \rv{a} \given \mat{X}} &= \Prob{\rv{f} \given \rv{a}} \prod_{d=1}^D \Prob{\rv{y_d} \given \rv{f_d}}\Prob{\rv{a_d} \given \rv{X}}, \\

160


 \rv{a_d} \mid \mat{X} &\sim \Gaussian{\mat{X}, \mat{K_{a, d}} + \sigma^2_{a, d}\Eye}, \\

161


 \rv{f} \mid \mat{a} &\sim \Gaussian{\mat{0}, \mat{K_f} + \sigma^2_f\Eye}, \\

162


 \rv{y_d} \mid \mat{f_d} &\sim \Gaussian{\mat{f_d}, \mat{K_{g, d}} + \sigma^2_{y, d}\Eye}.


156

+ \begin{aligned}


157

+ &\Prob{\rv{y}, \rv{f}, \rv{a} \given \mat{X}} = \\


158

+ &\qquad\Prob{\rv{f} \given \rv{a}} \prod_{d=1}^D \Prob{\rv{y_d} \given \rv{f_d}}\Prob{\rv{a_d} \given \rv{X}},


159

+ \end{aligned}


160

+ \qquad\qquad


161

+ \begin{aligned}


162

+ \rv{a_d} \mid \mat{X} &\sim \Gaussian{\mat{X}, \mat{K_{a, d}} + \sigma^2_{a, d}\Eye}, \\


163

+ \rv{f} \mid \mat{a} &\sim \Gaussian{\mat{0}, \mat{K_f} + \sigma^2_f\Eye}, \\


164

+ \rv{y_d} \mid \mat{f_d} &\sim \Gaussian{\mat{f_d}, \mat{K_{g, d}} + \sigma^2_{y, d}\Eye}.


165

+ \end{aligned}

163

166

\end{split}

164

167

\end{align}

165


Here, we use $\mat{K}$ to refer to the Gram matrix corresponding to the kernel of the respective GP.

166


All but the convolution processes factorize over both the different levels of the model as well as the different outputs.


168

+Here, we use $\mat{K}$ to refer to the Gram matrices corresponding to the respective GPs.


169

+All but the convolution processes factorize over the different levels of the model as well as the different outputs.

167

170


168

171


169

172

\subsection{Variational Lower Bound}


@@ 175,7 +178,6 @@ In this approximation, the original model is augmented with \emph{inducing varia

175

178

In contrast to \cite{titsias_variational_2009}, the distribution $\Variat*{\rv{u}}$ is not chosen optimally but optimized using the closed form $\Variat*{\rv{u}} \sim \Gaussian*{\rv{u} \given \mat{m}, \mat{S}}$.

176

179

This gives rise to the Scalable Variational GP presented in \cite{hensman_gaussian_2013}.

177

180

Second, in order to apply this variational bound for the individual GPs recursively, uncertainties have to be propagated through subsequent layers and interlayer crossdependencies are avoided using another variational approximation.

178




179

181

The variational lower bound for the AMOGP is given by

180

182

\begin{align}

181

183

\label{eq:full_bound}


@@ 366,25 +368,41 @@ Because of this, the model cannot recover the latent sine function and can only

366

368

\centering

367

369

\begin{subfigure}{.475\linewidth}

368

370

\centering

369


 \includestandalonewithpath{figures/wind_alignment_samples_left}


371

+ \includestandalonewithpath{figures/wind_shallow_gp_samples_left}

370

372

\caption{

371

373

\label{fig:wind_samples:a}

372


 Samples drawn from the AMOGP.


374

+ Samples from a GP.

373

375

}

374

376

\end{subfigure}

375

377

\hfill

376

378

\begin{subfigure}{.475\linewidth}

377

379

\centering

378


 \includestandalonewithpath{figures/wind_shallow_gp_samples_left}


380

+ \includestandalonewithpath{figures/wind_mo_gp_samples_left}

379

381

\caption{

380

382

\label{fig:wind_samples:b}

381


 Samples drawn from a MOGP.


383

+ Samples from a MOGP.


384

+ }


385

+ \end{subfigure}\\[\baselineskip]


386

+ \begin{subfigure}{.475\linewidth}


387

+ \centering


388

+ \includestandalonewithpath{figures/wind_dgp_samples_left}


389

+ \caption{


390

+ \label{fig:wind_samples:c}


391

+ Samples from a DGP.


392

+ }


393

+ \end{subfigure}


394

+ \hfill


395

+ \begin{subfigure}{.475\linewidth}


396

+ \centering


397

+ \includestandalonewithpath{figures/wind_alignment_samples_left}


398

+ \caption{


399

+ \label{fig:wind_samples:d}


400

+ Samples from the AMOGP.

382

401

}

383

402

\end{subfigure}

384

403

\caption{

385

404

\label{fig:wind_samples}

386


 A comparison of noiseless samples drawn from a MOGP and the AMOGP.

387


 The MOGP smoothens the data and models local variation as noise, while our model picks up on local structure and predicts the maximum in the center.


405

+ A comparison of noiseless samples drawn from a GP, a MOGP, a DGP and the AMOGP.

388

406

}

389

407

\end{figure}

390

408

\begin{table}[t]


@@ 410,6 +428,7 @@ This experiment is based on real data recorded from a pair of neighbouring wind

410

428

The two time series $\T_1$ and $\T_2$ shown in gray in \cref{fig:wind_joint_model} record the respective power generation of the two turbines over the course of one and a half hours, which was smoothed slightly using a rolling average over 60 seconds.

411

429

There are 5400 data points for the first blue turbine and 4622 data points for the second green turbine.

412

430

We removed two intervals (drawn as dashed lines) from the second turbine's data set to inspect the behaviour of the model with missing data.


431

+This allows us to evaluate and compare the generative properties of our model in \cref{fig:wind_samples}.

413

432


414

433

The amount of power generated by a wind turbine is mainly dependent on the speed of the wind fronts interacting with the turbine.

415

434

For system identification tasks concerned with the behaviour of multiple wind turbines, associating the observations on different turbines due to the same wind fronts is an important task.


@@ 439,8 +458,8 @@ The model is able to recover unambiguous associations well and successfully plac

439

458

As expected, the uncertainty about the alignment also grows where data for the second time series is missing.

440

459

This uncertainty is propagated through the shared function and results in higher predictive variances for the second time series.

441

460

Because of the factorization in the model however, we can recover the uncertainties about the alignment and the shared latent function separately.

442


\Cref{fig:wind_samples} compares samples drawn from our model with samples drawn from a standard MOGP.

443


The MOGP does not handle shortterm dynamics and smooths the signal enough such that the nonlinear alignment can be approximated as constant.


461

+\Cref{fig:wind_samples} compares samples drawn from our model with samples drawn from a GP, a MOGP and a DGP.


462

+The GP and deep GP revert to their respective priors when data is missing, while the MOGP does not handle shortterm dynamics and smooths the signal enough such that the nonlinear alignment can be approximated as constant.

444

463

AMOGP shows richer structure:

445

464

Samples show that it has learned that a maximum which is missing in the training data has to exist somewhere, but the uncertainty about the correct alignment due to the local turbulence means that different samples place the maximum at different locations in $\mat{X}$direction.

446

465



@@ 450,10 +469,12 @@ Samples show that it has learned that a maximum which is missing in the training

450

469

We have proposed the warped and aligned multioutput Gaussian process (AMOGP), in which MOGPs are embedded in a hierarchy to find shared structure in latent spaces.

451

470

We extended convolution processes \parencite{boyle_dependent_2004} with conditionally independent Gaussian processes on both the input and output sides, giving rise to a highly structured deep GP model.

452

471

This structure can be used to both regularize the model and encode expert knowledge about specific parts of the system.

453


By applying nested variational compression \parencite{hensman_nested_2014} to inference in these models, we derived a variational lower bound which combines Bayesian treatment of all parts of the model with scalability via stochastic optimization.


472

+By applying nested variational compression \parencite{hensman_nested_2014} to inference in these models, we presented a variational lower bound which combines Bayesian treatment of all parts of the model with scalability via stochastic optimization.

454

473


455


We compared the model with deep GPs and multioutput GPs on an artificial data set and showed how the richer modelstructure allows the AMOGP to pick up on latent structure the other approaches cannot model.

456


We then applied the AMOGP to realworld data of two wind turbines and showed how the proposed hierarchy can be used to encode expert knowledge, recover humaninterpretable structure from the data and yield an informative model with uncertainties decomposed along the hierarchy.


474

+We compared the model with GPs, deep GPs and multioutput GPs on an artificial data set and showed how the richer modelstructure allows the AMOGP to pick up on latent structure which the other approaches cannot model.


475

+We then applied the AMOGP to real world data of two wind turbines and used the proposed hierarchy to model wind propagation in a wind farm and recover information about the latent non homogeneous wind field.


476

+With uncertainties decomposed along the hierarchy, our approach handles ambiguities introduced by the stochasticity of the wind in a principled manner.


477

+This indicates the AMOGP is a good approach for these kinds of dynamical system, where multiple misaligned sensors measure the same latent effect.

457

478


458

479


459

480

\nocite{*}
