Browse Source

Reformulate the abstract, intro and conclusion

arxiv-v3
Markus Kaiser 1 year ago
parent
commit
8511783837
3 changed files with 57 additions and 25 deletions
  1. BIN
      bayesian_warped_dependent_gp.pdf
  2. 25
    25
      bayesian_warped_dependent_gp.tex
  3. 32
    0
      zotero_export.bib

BIN
bayesian_warped_dependent_gp.pdf View File


+ 25
- 25
bayesian_warped_dependent_gp.tex View File

@@ -43,12 +43,11 @@
43 43
 \maketitle
44 44
 
45 45
 \begin{abstract}
46
-    \todo{Reformulate abstract?}
47
-    We present a Bayesian extension to convolution processes which defines a representation between multiple functions via an embedding in a shared latent space.
46
+    We propose a novel and Bayesian approach to finding nonlinear alignments of time series based on latent shared information.
48 47
     The proposed model allows for both arbitrary alignments of the inputs and non-parametric output warpings to transform the observations.
49 48
     This gives rise to multiple deep Gaussian process models connected via latent generating processes.
50
-    We derive an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.
51
-    The method is applied to both an artificial data set and measurements of a pair of wind turbines.
49
+    We present an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem.
50
+    The method is applied to an artificial data set and real-world data of two wind turbines.
52 51
 \end{abstract}
53 52
 
54 53
 
@@ -57,9 +56,9 @@ Many real-world systems are inherently hierarchical and connected.
57 56
 Ideally, a machine learning method should model and recognize such dependencies.
58 57
 Take wind power production, which is one of the major providers for renewable energy today, as an example:
59 58
 To optimize the efficiency of a wind turbine the speed and pitch have to be controlled according to the local wind conditions (speed and direction).
60
-In a wind farm typically one or several turbines are equipped with sensors for wind speed and direction.
59
+In a wind farm turbines are typically equipped with sensors for wind speed and direction.
61 60
 The goal is to use these sensor data to produce accurate estimates and forecasts of the wind conditions at every turbine in the farm.
62
-For the ideal case of a homogeneous and very slowly changing wind field, the wind conditions at each geometrical position in a wind farm can be estimated using the propagation times (time warps) computed from geometry, wind speed, and direction.
61
+For the ideal case of a homogeneous and very slowly changing wind field, the wind conditions at each geometrical position in a wind farm can be estimated using the propagation times (time warps) computed from geometry, wind speed, and direction \parencite{soleimanzadeh_controller_2011,bitar_coordinated_2013,schepers_improved_2007}.
63 62
 In the real world, however, wind fields are not homogeneous, exhibit global and local turbulences, and interfere with the turbines and the terrain inside and outside the farm.
64 63
 This makes it extremely difficult to construct accurate analytical models of wind propagation in a farm.
65 64
 Also, standard approaches for extracting such information from data, e.g.\ generalized time warping  \parencite{zhou_generalized_2012}, fail at this task because they rely on a high signal to noise ratio.
@@ -76,7 +75,7 @@ This MO-GP acts as an interface to share information between the different GPs w
76 75
 The paper has the following contributions:
77 76
 In \cref{sec:model}, we propose a hierarchical, warped and aligned  multi-output Gaussian process (AMO-GP).
78 77
 In \cref{sec:variational_approximation}, we present an efficient learning scheme via an approximation to the marginal likelihood which allows us to fully exploit the regularization provided by our structure, yielding highly interpretable results.
79
-We show these properties for an artificial data set and observations for a pair of wind turbines in \cref{sec:experiments}.
78
+We show these properties for an artificial data set and for real-world data of two wind turbines in \cref{sec:experiments}.
80 79
 
81 80
 
82 81
 \section{Model Definition}
@@ -84,7 +83,7 @@ We show these properties for an artificial data set and observations for a pair
84 83
 We are interested in formulating shared priors over a set of functions $\Set{f_d}_{d=1}^D$ using GPs, thereby directly parameterizing their interdependencies.
85 84
 In a traditional GP setting, multiple outputs are considered conditionally independent given the inputs, which significantly reduces the computational cost but also prevents the utilization of shared information.
86 85
 Such interdependencies can be formulated via \emph{convolution processes (CPs)} as proposed by \textcite{boyle_dependent_2004}, a generalization of the \emph{linear model of coregionalization (LMC)} \parencite{journel_mining_1978,coburn_geostatistics_2000}.
87
-In the CP framework, the output functions are the result of a convolution of the latent processes with smoothing kernel functions $T_{d,r}$ for each output $f_d$ and latent process $w_r$, it holds that $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x} - \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.
86
+In the CP framework, the output functions are the result of a convolution of the latent processes with smoothing kernel functions $T_{d,r}$ for each output $f_d$ and latent process $w_r$, defined as $f_d(\mat{x}) = \sum_{r=1}^R \int T_{d,r}(\mat{x} - \mat{z}) \cdot w_r(\mat{z}) \diff \mat{z}$.
88 87
 
89 88
 In this model, the convolutions of the latent processes generating the different outputs are all performed around the same point $\mat{x}$.
90 89
 We generalize this by allowing different \emph{alignments} of the observations which depend on the position in the input space.
@@ -101,7 +100,7 @@ The final model is then given by
101 100
 \end{align}
102 101
 where $f_d$ is equal to the CP model above and $a_d$ and $g_d$ are the respective alignment and warping functions and $\mat{\epsilon_d} \sim \Gaussian{0, \sigma_{y, d}^2\Eye}$ is a noise term.
103 102
 This encodes the generative process described above:
104
-For every turbine $\rv{y_d}$, observations at times $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the latent wind front in $f_d$, imposing turbine-specific components $g_d$ and adding noise through $\rv{\epsilon_d}$.
103
+For every turbine $\rv{y_d}$, observations at positions $\rv{X}$ are generated by first aligning to the latent wind fronts using $a_d$, applying the latent wind front in $f_d$, imposing turbine-specific components $g_d$ and adding noise through $\rv{\epsilon_d}$.
105 104
 
106 105
 Because we assume independence between $a_d$ and $g_d$ across outputs, we use GP priors of the form $a_d \sim \GP(\id, k_{a, d})$ and $g_d \sim \GP(\id, k_{g, d})$.
107 106
 By setting the prior mean to the identity function $\id(x) = x$, we use the standard CP model as our default assumption.
@@ -122,7 +121,7 @@ With $\Set{\sigma_{d,r}, \mat{\ell_{d, r}}}$ denoting the set of kernel hyper pa
122 121
     \MoveEqLeft[1] \Moment{\cov}{f_d(\mat{x}), f_{d^\prime}(\mat{x^\prime})} = \sum_{r=1}^R \frac{(2\pi)^{\frac{K}{2}}\sigma_{d, r}\sigma_{d^\prime, r}}{\prod_{k=1}^K \hat{\ell}_{d, d^\prime, r, k}\inv} \Fun*{\exp}{-\frac{1}{2} \sum_{k=1}^K \frac{(x_k - x^\prime_k)^2}{\hat{\ell}_{d, d^\prime, r, k}^2}},
123 122
 \end{split}
124 123
 \end{align}
125
-with $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\ell_{d, r, k}^2 + \ell_{d^\prime, r, k}^2}$.
124
+where $\mat{x}$ is $K$-dimensional and $\hat{\ell}_{d, d^\prime, r, k} = \sqrt{\ell_{d, r, k}^2 + \ell_{d^\prime, r, k}^2}$.
126 125
 
127 126
 \begin{figure}[t]
128 127
     \centering
@@ -214,11 +213,13 @@ For notational simplicity, we only consider the case of one single latent proces
214 213
 Since the latent processes are independent, the results can easily be generalized to multiple processes.
215 214
 Then, $\psi_f$ is given by
216 215
 \begin{align}
216
+    \label{eq:psi0}
217 217
     \psi_f = \Moment*{\E_{\Variat{\rv{a}}}}{\Fun*{\tr}{\mat{K_{ff}}}} = \sum_{n=1}^N \hat{\sigma}_{nn}^2.
218 218
 \end{align}
219 219
 Similar to the notation $\Fun{\hat{f}}{\cdot}$, we use the notation $\hat{\sigma}_{nn^\prime}$ to mean the variance term associated with the covariance function $\Moment{\cov}{\Fun{\hat{f}}{\mat{a_n}}, \Fun{\hat{f}}{\mat{a_{n^\prime}}}}$ as shown in \cref{eq:dependent_kernel}.
220 220
 The expectation $\mat{\Psi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{fu}}}$ connecting the alignments and the pseudo inputs is given by
221 221
 \begin{align}
222
+    \label{eq:psi1}
222 223
 \begin{split}
223 224
     \left( \mat{\Psi_f} \right)_{ni} &= \hat{\sigma}_{ni}^2 \sqrt{\frac{(\mat{\Sigma_a})_{nn}\inv}{\hat{\ell}_{ni} + (\mat{\Sigma_a})_{nn}\inv}}
224 225
     \exp\left(-\frac{1}{2} \frac{(\mat{\Sigma_a})_{nn}\inv\hat{\ell}_{ni}}{(\mat{\Sigma_a})_{nn}\inv + \hat{\ell}_{ni}} \left((\mat{\mu_a})_n - \mat{u_i}\right)^2\right),
@@ -227,6 +228,7 @@ The expectation $\mat{\Psi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{fu}}}$ co
227 228
 where $\hat{\ell}_{ni}$ is the combined length scale corresponding to the same kernel as $\hat{\sigma}_{ni}$.
228 229
 Lastly, $\mat{\Phi_f} = \Moment*{\E_{\Variat{\rv{a}}}}{\mat{K_{uf}}\mat{K_{fu}}}$ connects alignments and pairs of pseudo inputs with the closed form
229 230
 \begin{align}
231
+    \label{eq:psi2}
230 232
 \begin{split}
231 233
     \left( \mat{\Phi_f} \right)_{ij} &= \sum_{n=1}^N \hat{\sigma}_{ni}^2 \hat{\sigma}_{nj}^2 \sqrt{\frac{(\mat{\Sigma_a})_{nn}\inv}{\hat{\ell}_{ni} + \hat{\ell}_{nj} + (\mat{\Sigma_a})_{nn}\inv}}
232 234
     \exp\left( -\frac{1}{2} \vphantom{\left(\frac{\hat{\ell}}{\hat{\ell}}\right)^2} \frac{\hat{\ell}_{ni}\hat{\ell}_{nj}}{\hat{\ell}_{ni} + \hat{\ell}_{nj}} (\mat{u_i} - \mat{u_j})^2 \right. \\
@@ -246,7 +248,7 @@ The graphical model shown in \cref{fig:graphical_model_supervised} illustrates t
246 248
 This CP acts as an interface to share information between the different GPs which are otherwise conditionally independent.
247 249
 This modelling-choice introduces a new quality to the model when compared to standard deep GPs with multiple output dimensions, since the latter are not able in principle to learn dependencies between the different outputs.
248 250
 Compared to standard multi-output GPs, the AMO-GP introduces more flexibility with respect to the shared information.
249
-CPs as introduced by \textcite{boyle_dependent_2004} make strong assumptions about the relative alignments of the different outputs, that is, they assume constant time-offsets.
251
+CPs make strong assumptions about the relative alignments of the different outputs, that is, they assume constant time-offsets.
250 252
 The AMO-GP extends on this by introducing a principled Bayesian treatment of general nonlinear alignments $a_d$ on which we can place informative priors derived from the problem at hand.
251 253
 Together with the warping layers $g_d$, our model can learn to share knowledge in an informative latent space learnt from the data.
252 254
 
@@ -254,7 +256,7 @@ In order to derive an inference scheme, we need the ability to propagate uncerta
254 256
 We adapted the approach of nested variational compression by \textcite{hensman_nested_2014}, which is originally concerned with a single deep GP.
255 257
 The approximation is expanded to handle multiple GPs at once, yielding the bound in \cref{eq:full_bound}.
256 258
 The bound reflects the dependencies of the different outputs as the sharing of information between the different deep GPs is approximated through the shared inducing variables $\rv{u_{f,d}}$.
257
-Our main contribution for the inference scheme is the derivation of a closed-form solution for the $\Psi$-statistics of the convolution kernel.
259
+Our main contribution for the inference scheme is the derivation of a closed-form solution for the $\Psi$-statistics of the convolution kernel in \cref{eq:psi0,eq:psi1,eq:psi2}.
258 260
 
259 261
 
260 262
 \section{Experiments}
@@ -292,7 +294,7 @@ Our main contribution for the inference scheme is the derivation of a closed-for
292 294
         \centering
293 295
         \includestandalonewithpath{figures/toy_decomposition_ours}
294 296
         \caption{
295
-            AMO-GP with (dependent) RBF kernels.
297
+            AMO-GP (Ours) with (dependent) RBF kernels.
296 298
             \label{fig:toy_model_decomposition:d}
297 299
         }
298 300
     \end{subfigure}
@@ -309,7 +311,7 @@ In this setting, we observe multiple time series $\T_i = (\mat{X_i}, \mat{y_i})$
309 311
 
310 312
 We will first apply the AMO-GP to an artificial data set in which we define a decomposed system of dependent time series by specifying a shared latent function generating the observations together with relative alignments and warpings for the different time series.
311 313
 We will show that our model is able to recover this decomposition from the training data and compare the results to other approaches of modeling the data.
312
-In the next section we focus on a real world data set of a neighbouring pair of wind turbines in a wind farm, where the model is able to recover a representation of the latent prevailing wind condition and the relative timings of wind fronts hitting the two turbines.
314
+Then we focus on a real world data set of a neighbouring pair of wind turbines in a wind farm, where the model is able to recover a representation of the latent prevailing wind condition and the relative timings of wind fronts hitting the two turbines.
313 315
 
314 316
 \subsection{Artificial data set}
315 317
 \label{subsec:artificial_example}
@@ -334,7 +336,7 @@ Otherwise, the model could easily get stuck in local minima like choosing the up
334 336
 Additionally, our assumption of identity mean functions prevents pathological cases in which the complete model collapses to a constant function.
335 337
 
336 338
 \Cref{fig:toy_model_decomposition:d} shows the AMO-GP's recovered function decomposition and joint predictions.
337
-The model succesfully recovered a shared latent dampened sine function, a sigmoid warping for the first time series and an approximate quadratic alignment function for the second time series.
339
+The model successfully recovered a shared latent dampened sine function, a sigmoid warping for the first time series and an approximate quadratic alignment function for the second time series.
338 340
 In \cref{fig:toy_model_decomposition:a,fig:toy_model_decomposition:b,fig:toy_model_decomposition:c}, we show the training results of a standard GP, a multi-output GP and a three-layer deep GP on the same data.
339 341
 For all of these models, we used RBF kernels and, in the case of the deep GP, applied priors similar to our model in order to avoid pathological cases.
340 342
 In \cref{tab:toy_model_log_likelihoods} we report test log-likelihoods for the presented models, which illustrate the qualitative differences between the models.
@@ -358,7 +360,6 @@ Because of this, the model cannot recover the latent sine function and can only
358 360
         The joint posterior for two time series $\rv{y_1}$ and $\rv{y_2}$ of power production for a pair of wind turbines.
359 361
         The training data and dashed missing data are shown in grey.
360 362
         The AMO-GP recovers an uncertain relative alignment of the two time series.
361
-        High uncertainty about the alignment is placed in areas where multiple explanations are plausible.
362 363
     }
363 364
 \end{figure}
364 365
 \begin{figure}[t]
@@ -382,9 +383,8 @@ Because of this, the model cannot recover the latent sine function and can only
382 383
     \end{subfigure}
383 384
     \caption{
384 385
         \label{fig:wind_samples}
385
-        A comparison of noiseless samples drawn from a MO-GP and the AMO-GP, which achieve comparable test log-likelihoods.
386
+        A comparison of noiseless samples drawn from a MO-GP and the AMO-GP.
386 387
         The MO-GP smoothens the data and models local variation as noise, while our model picks up on local structure and predicts the maximum in the center.
387
-        Samples drawn from our model show that it is certain about the maximum but uncertain about the timing due to multiple plausible alignments.
388 388
     }
389 389
 \end{figure}
390 390
 \begin{table}[t]
@@ -394,7 +394,7 @@ Because of this, the model cannot recover the latent sine function and can only
394 394
         Test-log-likelihoods for the models presented in \cref{sec:experiments}.
395 395
     }
396 396
     \newcolumntype{Y}{>{\centering\arraybackslash}X}
397
-    \begin{tabularx}{.95\linewidth}{rrYYYY}
397
+    \begin{tabularx}{\linewidth}{rrYYYY}
398 398
         \toprule
399 399
         Experiment & Test set & GP & MO-GP & DGP & AMO-GP (Ours) \\
400 400
         \midrule
@@ -422,11 +422,10 @@ Modelling the turbine-specific parts of the signals is not the objective, so the
422 422
 We use a squared exponential kernel as a prior for the alignment functions $\rv{a_i}$ and as smoothing kernels in $\rv{f}$.
423 423
 For the given data set we can assume the output warpings $\rv{g_i}$ to be linear functions because there is only one dimension, the power generation, which in this data set is of similar shape for both turbines.
424 424
 Again we encode a preference for alignments with slow dynamics with a prior on the length scales of $\rv{a_i}$.
425
-
426 425
 As the signal has turbine-specific autoregressive components, plausible alignments are not unique.
427 426
 To constrain the AMO-GP, we want it to prefer alignments close to the identity function which we chose as a prior mean function.
428
-In non-hierarchical models, this can be achieved by placing a prior on the kernel's variance preferring smaller $\sigma_a^2$.
429 427
 
428
+% In non-hierarchical models, this can be achieved by placing a prior on the kernel's variance preferring smaller $\sigma_a^2$.
430 429
 % In our case, the posterior space of the alignment process $\rv{a}$ is a latent space we have not placed any prior on.
431 430
 % The model can therefore choose the posterior distribution of $\rv{u_a}$ in a way to counteract the constrained scale of $\mat{K_{au}}$ and $\mat{K_{uu}}\inv$ in \cref{eq:augmented_joint}\todo{This equation does not exist anymore} and thereby circumvent the prior.
432 431
 % To prevent this, we also place a prior on the mean of $\rv{u_a}$ to remove this degree of freedom.
@@ -434,7 +433,6 @@ In non-hierarchical models, this can be achieved by placing a prior on the kerne
434 433
 \Cref{fig:wind_joint_model} shows the joint model learned from the data in which $a_1$ is chosen to be the identity function.
435 434
 The possible alignments identified match the physical conditions of the wind farm.
436 435
 For the given turbines, time offsets of up to six minutes are plausible and for most wind conditions, the offset is expected to be close to zero.
437
-In order to calculate the posteriors, the uncertainty about the correct alignment is only propagated through the shared function for $\T_2$, leading to smoother predictions.
438 436
 For areas where the alignment is quite certain however, the two time series are explained with comparable detail.
439 437
 The model is able to recover unambiguous associations well and successfully places high uncertainty on the alignment in areas where multiple explanations are plausible due to the noisy signal.
440 438
 
@@ -443,19 +441,21 @@ This uncertainty is propagated through the shared function and results in higher
443 441
 Because of the factorization in the model however, we can recover the uncertainties about the alignment and the shared latent function separately.
444 442
 \Cref{fig:wind_samples} compares samples drawn from our model with samples drawn from a standard MO-GP.
445 443
 The MO-GP does not handle short-term dynamics and smooths the signal enough such that the nonlinear alignment can be approximated as constant.
446
-This can lead to comparable test-log-likelihoods, but since short-term dynamics are explained as noise, the model cannot be used for predictions in this time scale.
447 444
 AMO-GP shows richer structure:
448
-Samples show that it has learned that a maximum which is missing from the training data has to exist somewhere, but the uncertainty about the correct alignment due to the local turbulence means that different samples place the maximum at different locations in $\mat{X}$-direction.
445
+Samples show that it has learned that a maximum which is missing in the training data has to exist somewhere, but the uncertainty about the correct alignment due to the local turbulence means that different samples place the maximum at different locations in $\mat{X}$-direction.
449 446
 
450 447
 
451 448
 \section{Conclusion}
452 449
 \label{sec:conclusion}
453 450
 \todo{Reformulate this}
454
-We have presented how Gaussian processes with multiple outputs can be embedded in a hierarchy to find shared structure in latent spaces.
451
+We have proposed the warped and aligned multi-output Gaussian process (AMO-GP), in which MO-GPs are embedded in a hierarchy to find shared structure in latent spaces.
455 452
 We extended convolution processes \parencite{boyle_dependent_2004} with conditionally independent Gaussian processes on both the input and output sides, giving rise to a highly structured deep GP model.
456 453
 This structure can be used to both regularize the model and encode expert knowledge about specific parts of the system.
457 454
 By applying nested variational compression \parencite{hensman_nested_2014} to inference in these models, we derived a variational lower bound which combines Bayesian treatment of all parts of the model with scalability via stochastic optimization.
458 455
 
456
+We compared the model with deep GPs and multi-output GPs on an artificial data set and showed how the richer model-structure allows the AMO-GP to pick up on latent structure the other approaches cannot model.
457
+We then applied the AMO-GP to real-world data of two wind turbines and showed how the proposed hierarchy can be used to encode expert knowledge, recover human-interpretable structure from the data and yield an informative model with uncertainties decomposed along the hierarchy.
458
+
459 459
 
460 460
 \nocite{*}
461 461
 \printbibliography

+ 32
- 0
zotero_export.bib View File

@@ -212,4 +212,36 @@
212 212
 	author = {Journel, Andre G. and Huijbregts, Ch J.},
213 213
 	year = {1978},
214 214
 	file = {Snapshot:C\:\\Users\\z003jvyc\\Zotero\\storage\\T8DUJGCJ\\27687.pdf:application/pdf}
215
+}
216
+
217
+@article{soleimanzadeh_controller_2011,
218
+	title = {Controller design for a wind farm, considering both power and load aspects},
219
+	volume = {21},
220
+	number = {4},
221
+	journal = {Mechatronics},
222
+	author = {Soleimanzadeh, Maryam and Wisniewski, Rafael},
223
+	year = {2011},
224
+	pages = {720--727},
225
+	file = {Snapshot:C\:\\Users\\z003jvyc\\Zotero\\storage\\78N9YN8C\\S0957415811000328.html:text/html}
226
+}
227
+
228
+@inproceedings{bitar_coordinated_2013,
229
+	title = {Coordinated control of a wind turbine array for power maximization},
230
+	booktitle = {American {Control} {Conference} ({ACC}), 2013},
231
+	publisher = {IEEE},
232
+	author = {Bitar, Eilyan and Seiler, Pete},
233
+	year = {2013},
234
+	pages = {2898--2904},
235
+	file = {Fulltext:C\:\\Users\\z003jvyc\\Zotero\\storage\\TQRZDYMQ\\coordinated-control-of-a-wind-turbine-array-for-power-maximizatio.html:text/html;Snapshot:C\:\\Users\\z003jvyc\\Zotero\\storage\\N8MHP7CD\\auD.html:text/html}
236
+}
237
+
238
+@inproceedings{schepers_improved_2007,
239
+	title = {Improved modelling of wake aerodynamics and assessment of new farm control strategies},
240
+	volume = {75},
241
+	booktitle = {Journal of {Physics}: {Conference} {Series}},
242
+	publisher = {IOP Publishing},
243
+	author = {Schepers, J. G. and Van der Pijl, S. P.},
244
+	year = {2007},
245
+	pages = {012039},
246
+	file = {Snapshot:C\:\\Users\\z003jvyc\\Zotero\\storage\\AUAMVYLD\\auD.html:text/html}
215 247
 }

Loading…
Cancel
Save