Browse Source

Add last feedback; Squeeze bound to page 3

Markus Kaiser 8 months ago
parent
commit
1b8cb7d362
3 changed files with 77 additions and 56 deletions
  1. BIN
      bayesian_warped_dependent_gp.pdf
  2. 76
    55
      bayesian_warped_dependent_gp.tex
  3. 1
    1
      preamble/packages_paper.tex

BIN
bayesian_warped_dependent_gp.pdf View File


+ 76
- 55
bayesian_warped_dependent_gp.tex View File

@@ -20,34 +20,37 @@
20 20
 \author{\href{mailto:markus.kaiser@siemens.com}{Markus Kaiser}}
21 21
 
22 22
 \author{
23
-    Markus Kaiser\\
24
-    Siemens AG\\
25
-    Technical University of Munich\\
26
-    \texttt{markus.kaiser@siemens.com}\\
27
-    \And
28
-    Clemens Otte\\
29
-    Siemens AG\\
30
-    \texttt{clemens.otte@siemens.com}\\
31
-    \And
32
-    Thomas Runkler\\
33
-    Siemens AG\\
34
-    Technical University of Munich\\
35
-    \texttt{thomas.runkler@siemens.com}\\
36
-    \And
37
-    Carl Henrik Ek\\
38
-    University of Bristol\\
39
-    \texttt{carlhenrik.ek@bristol.ac.uk}\\
23
+    % Markus Kaiser\\
24
+    % Siemens AG\\
25
+    % Technical University of Munich\\
26
+    % \texttt{markus.kaiser@siemens.com}\\
27
+    % \And
28
+    % Clemens Otte\\
29
+    % Siemens AG\\
30
+    % \texttt{clemens.otte@siemens.com}\\
31
+    % \And
32
+    % Thomas Runkler\\
33
+    % Siemens AG\\
34
+    % Technical University of Munich\\
35
+    % \texttt{thomas.runkler@siemens.com}\\
36
+    % \And
37
+    % Carl Henrik Ek\\
38
+    % University of Bristol\\
39
+    % \texttt{carlhenrik.ek@bristol.ac.uk}\\
40
+    % NOTE: Fix metadata.
41
+    Anonymous\\
40 42
 }
41 43
 
42 44
 \begin{document}
43 45
 \maketitle
44 46
 
45 47
 \begin{abstract}
46
-    We propose a novel and Bayesian approach to finding nonlinear alignments of time series based on latent shared information.
48
+    We propose a novel Bayesian approach to modelling nonlinear alignments of time series based on latent shared information.
49
+    We apply the method to the real-world problem of finding common structure in the sensor data of wind turbines introduced by the underlying latent and turbulent wind field.
47 50
     The proposed model allows for both arbitrary alignments of the inputs and non-parametric output warpings to transform the observations.
48 51
     This gives rise to multiple deep Gaussian process models connected via latent generating processes.
49 52
     We present an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.
50
-    The method is applied to an artificial data set and real-world data of two wind turbines.
53
+    We show results for an artificial data set and real-world data of two wind turbines.
51 54
 \end{abstract}
52 55
 
53 56
 
@@ -59,12 +62,12 @@ To optimize the efficiency of a wind turbine the speed and pitch have to be cont
59 62
 In a wind farm turbines are typically equipped with sensors for wind speed and direction.
60 63
 The goal is to use these sensor data to produce accurate estimates and forecasts of the wind conditions at every turbine in the farm.
61 64
 For the ideal case of a homogeneous and very slowly changing wind field, the wind conditions at each geometrical position in a wind farm can be estimated using the propagation times (time warps) computed from geometry, wind speed, and direction \parencite{soleimanzadeh_controller_2011,bitar_coordinated_2013,schepers_improved_2007}.
62
-In the real world, however, wind fields are not homogeneous, exhibit global and local turbulences, and interfere with the turbines and the terrain inside and outside the farm.
65
+In the real world, however, wind fields are not homogeneous, exhibit global and local turbulences, and interfere with the turbines and the terrain inside and outside the farm and further, breaking sensors can lead to data loss.
63 66
 This makes it extremely difficult to construct accurate analytical models of wind propagation in a farm.
64 67
 Also, standard approaches for extracting such information from data, e.g.\ generalized time warping  \parencite{zhou_generalized_2012}, fail at this task because they rely on a high signal to noise ratio.
65 68
 Instead, we want to construct Bayesian nonlinear dynamic data based models for wind conditions and warpings which handle the stochastic nature of the system in a principled manner.
66 69
 
67
-In this paper, we look at a generalization of this type of problem and propose a novel and Bayesian approach to finding nonlinear alignments of time series based on latent shared information.
70
+In this paper, we look at a generalization of this type of problem and propose a novel Bayesian approach to finding nonlinear alignments of time series based on latent shared information.
68 71
 We view the power production of different wind turbines as the outputs of a multi-output Gaussian process (MO-GP) \parencite{alvarez_kernels_2011} which models the latent wind fronts.
69 72
 We embed this model in a hierarchy, adding a layer of non-linear alignments on top and a layer of non-linear warpings \parencites{NIPS2003_2481,lazaro-gredilla_bayesian_2012} below which increases flexibility and encodes the original generative process.
70 73
 We show how the resulting model can be interpreted as a group of deep Gaussian processes with the added benefit of covariances between different outputs.
@@ -83,7 +86,7 @@ We show these properties for an artificial data set and for real-world data of t
83 86
 We are interested in formulating shared priors over a set of functions $\Set{f_d}_{d=1}^D$ using GPs, thereby directly parameterizing their interdependencies.
84 87
 In a traditional GP setting, multiple outputs are considered conditionally independent given the inputs, which significantly reduces the computational cost but also prevents the utilization of shared information.
85 88
 Such interdependencies can be formulated via \emph{convolution processes (CPs)} as proposed by \textcite{boyle_dependent_2004}, a generalization of the \emph{linear model of coregionalization (LMC)} \parencite{journel_mining_1978,coburn_geostatistics_2000}.
86
-In the CP framework, the output functions are the result of a convolution of the latent processes with smoothing kernel functions $T_{d,r}$ for each output $f_d$ and latent process $w_r$, defined as $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x} - \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.
89
+In the CP framework, the output functions are the result of a convolution of the latent processes $w_r$ with smoothing kernel functions $T_{d,r}$ for each output $f_d$, defined as $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x} - \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.
87 90
 
88 91
 In this model, the convolutions of the latent processes generating the different outputs are all performed around the same point $\mat{x}$.
89 92
 We generalize this by allowing different \emph{alignments} of the observations which depend on the position in the input space.
@@ -98,23 +101,18 @@ The final model is then given by
98 101
 \begin{align}
99 102
     \mat{y_d} &= g_d(f_d(a_d(\mat{X}))) + \mat{\epsilon_d},
100 103
 \end{align}
101
-where $f_d$ is equal to the CP model above and $a_d$ and $g_d$ are the respective alignment and warping functions and $\mat{\epsilon_d} \sim \Gaussian{0, \sigma_{y, d}^2\Eye}$ is a noise term.
104
+where $\mat{\epsilon_d} \sim \Gaussian{0, \sigma_{y, d}^2\Eye}$ is a noise term.
102 105
 This encodes the generative process described above:
103
-For every turbine $\rv{y_d}$, observations at positions $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the latent wind front in $f_d$, imposing turbine-specific components $g_d$ and adding noise through $\rv{\epsilon_d}$.
106
+For every turbine $\rv{y_d}$, observations at positions $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the front in $f_d$, imposing turbine-specific components $g_d$ and adding noise in $\rv{\epsilon_d}$.
104 107
 
105
-Because we assume independence between $a_d$ and $g_d$ across outputs, we use GP priors of the form $a_d \sim \GP(\id, k_{a, d})$ and $g_d \sim \GP(\id, k_{g, d})$.
106
-By setting the prior mean to the identity function $\id(x) = x$, we use the standard CP model as our default assumption.
108
+We assume independence between $a_d$ and $g_d$ across outputs and apply GP priors of the form $a_d \sim \GP(\id, k_{a, d})$ and $g_d \sim \GP(\id, k_{g, d})$.
109
+By setting the prior mean to the identity function $\id(x) = x$, the standard CP model is our default assumption.
107 110
 During learning, the model can choose the different $a_d$ and $g_d$ in a way to reveal the independent shared latent processes $\Set{w_r}_{r=1}^R$ on which we also place GP priors $w_r \sim \GP(0, k_{u, r})$.
108 111
 Similar to \textcite{boyle_dependent_2004}, we assume the latent processes to be independent white noise processes by setting $\Moment{\cov}{w_r(\mat{z}), w_{r^\prime}(\mat{z^\prime})} = \delta_{rr^\prime}\delta_{\mat{z}\mat{z^\prime}}$.
109
-Under this prior, the $f_d$ are also GPs  with zero mean and
110
-\begin{align}
111
-\begin{split}
112
-    \MoveEqLeft \Moment{\cov}{f_d(\mat{x}), f_{d^\prime}(\mat{x^\prime})} = \sum_{r=1}^R \int T_{d,r}(\mat{x} - \mat{z}) T_{d^\prime,r}(\mat{x^\prime} - \mat{z}) \diff \mat{z}.
113
-\end{split}
114
-\end{align}
112
+Under this prior, the $f_d$ are also GPs with zero mean and $\Moment{\cov}{f_d(\mat{x}), f_{d^\prime}(\mat{x^\prime})} = \sum_{r=1}^R \int T_{d,r}(\mat{x} - \mat{z}) T_{d^\prime,r}(\mat{x^\prime} - \mat{z}) \diff \mat{z}$.
115 113
 
116
-Using the squared exponential kernel as the smoothing kernel function for all $T_{d, r}$, the integral can be shown to have a closed form solution.
117
-With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyper parameters associated with $T_{d,r}$, it is given by
114
+Using the squared exponential kernel for all $T_{d, r}$, the integral can be shown to have a closed form solution.
115
+With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the kernel hyper parameters associated with $T_{d,r}$, it is given by
118 116
 \begin{align}
119 117
 \label{eq:dependent_kernel}
120 118
 \begin{split}
@@ -122,7 +120,6 @@ With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyper pa
122 120
 \end{split}
123 121
 \end{align}
124 122
 where $\mat{x}$ is $K$-dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\ell_{d, r, k}^2 + \ell_{d^\prime, r, k}^2}$.
125
-
126 123
 \begin{figure}[t]
127 124
     \centering
128 125
     \begin{minipage}[t]{.45\textwidth}
@@ -131,7 +128,7 @@ where $\mat{x}$ is $K$-dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\
131 128
         \caption{
132 129
             \label{fig:graphical_model_supervised}
133 130
             Our proposed graphical model with variational parameters (blue).
134
-            A shared explanation for multiple data sets is learned through a shared CP together with independent alignments and warpings.
131
+            A CP models shared information between multiple data sets with nonlinear alignments and warpings.
135 132
         }
136 133
     \end{minipage}
137 134
     \hfill
@@ -149,21 +146,27 @@ where $\mat{x}$ is $K$-dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\
149 146
 
150 147
 \section{Variational Approximation}
151 148
 \label{sec:variational_approximation}
152
-Since exact inference in this model is intractable, we present a variational approximation to the model's true marginal likelihood and posterior in this section.
149
+Since exact inference in this model is intractable, we present a variational approximation to the model's marginal likelihood in this section.
153 150
 A detailed derivation of the variational bound can be found in the supplementary material.
154 151
 Analogously to $\mat{y}$, we denote the random vectors which contain the function values of the respective functions and outputs as $\rv{a}$ and $\rv{f}$.
155 152
 The joint probability distribution of the data can then be written as
156 153
 \begin{align}
157 154
 \begin{split}
158 155
     \label{eq:full_model}
159
-    \Prob{\rv{y}, \rv{f}, \rv{a} \given \mat{X}} &= \Prob{\rv{f} \given \rv{a}} \prod_{d=1}^D \Prob{\rv{y_d} \given \rv{f_d}}\Prob{\rv{a_d} \given \rv{X}}, \\
160
-    \rv{a_d} \mid \mat{X} &\sim \Gaussian{\mat{X}, \mat{K_{a, d}} + \sigma^2_{a, d}\Eye}, \\
161
-    \rv{f} \mid \mat{a} &\sim \Gaussian{\mat{0}, \mat{K_f} + \sigma^2_f\Eye}, \\
162
-    \rv{y_d} \mid \mat{f_d} &\sim \Gaussian{\mat{f_d}, \mat{K_{g, d}} + \sigma^2_{y, d}\Eye}.
156
+    \begin{aligned}
157
+        &\Prob{\rv{y}, \rv{f}, \rv{a} \given \mat{X}} = \\
158
+        &\qquad\Prob{\rv{f} \given \rv{a}} \prod_{d=1}^D \Prob{\rv{y_d} \given \rv{f_d}}\Prob{\rv{a_d} \given \rv{X}},
159
+    \end{aligned}
160
+    \qquad\qquad
161
+    \begin{aligned}
162
+        \rv{a_d} \mid \mat{X} &\sim \Gaussian{\mat{X}, \mat{K_{a, d}} + \sigma^2_{a, d}\Eye}, \\
163
+        \rv{f} \mid \mat{a} &\sim \Gaussian{\mat{0}, \mat{K_f} + \sigma^2_f\Eye}, \\
164
+        \rv{y_d} \mid \mat{f_d} &\sim \Gaussian{\mat{f_d}, \mat{K_{g, d}} + \sigma^2_{y, d}\Eye}.
165
+    \end{aligned}
163 166
 \end{split}
164 167
 \end{align}
165
-Here, we use $\mat{K}$ to refer to the Gram matrix corresponding to the kernel of the respective GP.
166
-All but the convolution processes factorize over both the different levels of the model as well as the different outputs.
168
+Here, we use $\mat{K}$ to refer to the Gram matrices corresponding to the respective GPs.
169
+All but the convolution processes factorize over the different levels of the model as well as the different outputs.
167 170
 
168 171
 
169 172
 \subsection{Variational Lower Bound}
@@ -175,7 +178,6 @@ In this approximation, the original model is augmented with \emph{inducing varia
175 178
 In contrast to \cite{titsias_variational_2009}, the distribution $\Variat*{\rv{u}}$ is not chosen optimally but optimized using the closed form $\Variat*{\rv{u}} \sim \Gaussian*{\rv{u} \given \mat{m}, \mat{S}}$.
176 179
 This gives rise to the Scalable Variational GP presented in \cite{hensman_gaussian_2013}.
177 180
 Second, in order to apply this variational bound for the individual GPs recursively, uncertainties have to be propagated through subsequent layers and inter-layer cross-dependencies are avoided using another variational approximation.
178
-
179 181
 The variational lower bound for the AMO-GP is given by
180 182
 \begin{align}
181 183
 \label{eq:full_bound}
@@ -366,25 +368,41 @@ Because of this, the model cannot recover the latent sine function and can only
366 368
     \centering
367 369
     \begin{subfigure}{.475\linewidth}
368 370
         \centering
369
-        \includestandalonewithpath{figures/wind_alignment_samples_left}
371
+        \includestandalonewithpath{figures/wind_shallow_gp_samples_left}
370 372
         \caption{
371 373
             \label{fig:wind_samples:a}
372
-            Samples drawn from the AMO-GP.
374
+            Samples from a GP.
373 375
         }
374 376
     \end{subfigure}
375 377
     \hfill
376 378
     \begin{subfigure}{.475\linewidth}
377 379
         \centering
378
-        \includestandalonewithpath{figures/wind_shallow_gp_samples_left}
380
+        \includestandalonewithpath{figures/wind_mo_gp_samples_left}
379 381
         \caption{
380 382
             \label{fig:wind_samples:b}
381
-            Samples drawn from a MO-GP.
383
+            Samples from a MO-GP.
384
+        }
385
+    \end{subfigure}\\[\baselineskip]
386
+    \begin{subfigure}{.475\linewidth}
387
+        \centering
388
+        \includestandalonewithpath{figures/wind_dgp_samples_left}
389
+        \caption{
390
+            \label{fig:wind_samples:c}
391
+            Samples from a DGP.
392
+        }
393
+    \end{subfigure}
394
+    \hfill
395
+    \begin{subfigure}{.475\linewidth}
396
+        \centering
397
+        \includestandalonewithpath{figures/wind_alignment_samples_left}
398
+        \caption{
399
+            \label{fig:wind_samples:d}
400
+            Samples from the AMO-GP.
382 401
         }
383 402
     \end{subfigure}
384 403
     \caption{
385 404
         \label{fig:wind_samples}
386
-        A comparison of noiseless samples drawn from a MO-GP and the AMO-GP.
387
-        The MO-GP smoothens the data and models local variation as noise, while our model picks up on local structure and predicts the maximum in the center.
405
+        A comparison of noiseless samples drawn from a GP, a MO-GP, a DGP and the AMO-GP.
388 406
     }
389 407
 \end{figure}
390 408
 \begin{table}[t]
@@ -410,6 +428,7 @@ This experiment is based on real data recorded from a pair of neighbouring wind
410 428
 The two time series $\T_1$ and $\T_2$ shown in gray in \cref{fig:wind_joint_model} record the respective power generation of the two turbines over the course of one and a half hours, which was smoothed slightly using a rolling average over 60 seconds.
411 429
 There are 5400 data points for the first blue turbine and 4622 data points for the second green turbine.
412 430
 We removed two intervals (drawn as dashed lines) from the second turbine's data set to inspect the behaviour of the model with missing data.
431
+This allows us to evaluate and compare the generative properties of our model in \cref{fig:wind_samples}.
413 432
 
414 433
 The amount of power generated by a wind turbine is mainly dependent on the speed of the wind fronts interacting with the turbine.
415 434
 For system identification tasks concerned with the behaviour of multiple wind turbines, associating the observations on different turbines due to the same wind fronts is an important task.
@@ -439,8 +458,8 @@ The model is able to recover unambiguous associations well and successfully plac
439 458
 As expected, the uncertainty about the alignment also grows where data for the second time series is missing.
440 459
 This uncertainty is propagated through the shared function and results in higher predictive variances for the second time series.
441 460
 Because of the factorization in the model however, we can recover the uncertainties about the alignment and the shared latent function separately.
442
-\Cref{fig:wind_samples} compares samples drawn from our model with samples drawn from a standard MO-GP.
443
-The MO-GP does not handle short-term dynamics and smooths the signal enough such that the nonlinear alignment can be approximated as constant.
461
+\Cref{fig:wind_samples} compares samples drawn from our model with samples drawn from a GP, a MO-GP and a DGP.
462
+The GP and deep GP revert to their respective priors when data is missing, while the MO-GP does not handle short-term dynamics and smooths the signal enough such that the nonlinear alignment can be approximated as constant.
444 463
 AMO-GP shows richer structure:
445 464
 Samples show that it has learned that a maximum which is missing in the training data has to exist somewhere, but the uncertainty about the correct alignment due to the local turbulence means that different samples place the maximum at different locations in $\mat{X}$-direction.
446 465
 
@@ -450,10 +469,12 @@ Samples show that it has learned that a maximum which is missing in the training
450 469
 We have proposed the warped and aligned multi-output Gaussian process (AMO-GP), in which MO-GPs are embedded in a hierarchy to find shared structure in latent spaces.
451 470
 We extended convolution processes \parencite{boyle_dependent_2004} with conditionally independent Gaussian processes on both the input and output sides, giving rise to a highly structured deep GP model.
452 471
 This structure can be used to both regularize the model and encode expert knowledge about specific parts of the system.
453
-By applying nested variational compression \parencite{hensman_nested_2014} to inference in these models, we derived a variational lower bound which combines Bayesian treatment of all parts of the model with scalability via stochastic optimization.
472
+By applying nested variational compression \parencite{hensman_nested_2014} to inference in these models, we presented a variational lower bound which combines Bayesian treatment of all parts of the model with scalability via stochastic optimization.
454 473
 
455
-We compared the model with deep GPs and multi-output GPs on an artificial data set and showed how the richer model-structure allows the AMO-GP to pick up on latent structure the other approaches cannot model.
456
-We then applied the AMO-GP to real-world data of two wind turbines and showed how the proposed hierarchy can be used to encode expert knowledge, recover human-interpretable structure from the data and yield an informative model with uncertainties decomposed along the hierarchy.
474
+We compared the model with GPs, deep GPs and multi-output GPs on an artificial data set and showed how the richer model-structure allows the AMO-GP to pick up on latent structure which the other approaches cannot model.
475
+We then applied the AMO-GP to real world data of two wind turbines and used the proposed hierarchy to model wind propagation in a wind farm and recover information about the latent non homogeneous wind field.
476
+With uncertainties decomposed along the hierarchy, our approach handles ambiguities introduced by the stochasticity of the wind in a principled manner.
477
+This indicates the AMO-GP is a good approach for these kinds of dynamical system, where multiple misaligned sensors measure the same latent effect.
457 478
 
458 479
 
459 480
 \nocite{*}

+ 1
- 1
preamble/packages_paper.tex View File

@@ -22,7 +22,7 @@
22 22
 \usepackage{microtype}
23 23
 
24 24
 % Layout
25
-\usepackage[labelfont=bf, format=plain, indention=1em]{caption}
25
+\usepackage[labelfont=bf, format=plain]{caption}
26 26
 \usepackage[skip=3pt]{subcaption}
27 27
 
28 28
 % Tables

Loading…
Cancel
Save