Start formulating policy search chapter

Markus Kaiser 4 years ago
parent 6c1cb2b07c
commit cd2ca8f892

@ -50,7 +50,7 @@ authors for more details).
\section{The Wet-Chicken Benchmark}
In this work, we consider the Wet-Chicken benchmark \cite{tresp_wet_1994,hans_efficient_2009}, a challenging problem for model-based reinforcement learning with bimodal and heteroskedastic dynamics.
Here, we focus on the two-dimensional variant with continuous state and action spaces.
@ -58,7 +58,7 @@ In the Wet-Chicken problem, a canoeist is paddling in a two-dimensional river.
The canoeist's position at time $t$ is given by $\mat{s}_t = (x_t, y_t) \in \R^2$ with the initial position $(0, 0)$.
This position is non-negative and bounded by the river's length $l = 5$ and the river's width $w = 5$.
There is a waterfall at the end of the river at $x = l$.
The canoeist wants to get as close to the waterfall as possible to maximize the reward $\Fun*{r}{\mat{s}_t} = x_t$.
The canoeist wants to get as close to the waterfall as possible to maximize the reward $\Fun*{r}{\mat{s}_t} = r_t = x_t$.
However, if the canoeist falls down the waterfall it has to start over again at the initial position.
The transition dynamics in this system are governed by the flow of water in the river.
@ -96,7 +96,7 @@ Together with the heteroskedasticity introduced by the turbulence dependent on t
Generative process for $J^\pi$.
@ -105,30 +105,47 @@ Together with the heteroskedasticity introduced by the turbulence dependent on t
Transition model based on MDGP.
Graphical model.
Graphical models.
We ware interested in the \emph{discounted return}
We consider batch reinforcement learning problems with continuous state and action spaces~\cite{lange_batch_2012,sutton_reinforcement_1998}.
In batch RL, we are presented with a batch of state transitions $\D = \Set{(\mat{s}_n, \mat{a}_n, \mat{s}_n^\prime)}_{n=1}^N$ consisting of triples of state, action and following state.
Additionally, we are given a reward function $r : \mat{s} \to \R$ which specifies the task to be solved.
For notational simplicity, we assume that the reward function is only dependent on the current state instead of the full triple of previous state, action and current state.
An extension of our method to this more general setup is straightforward.
The task is to obtain a policy $\pi : \mat{s} \mapsto \mat{a}$ maximizing the reward obtained in the original system.
More specifically, we are interested in finding a policy specified by the parameters $\mat{\theta}_\pi$ which maximizes the discounted return
J^\pi(\mat{\theta}_\pi) = \sum_{t=0}^T \gamma^t\mat{r}_t
J^\pi(\mat{\theta}_\pi) = \sum_{t=0}^T \gamma^t \Fun*{r}{\mat{s}_t} = \sum_{t=0}^T \gamma^t r_t
obtained by applying the policy on the original starting from an initial state $\mat{s}_0$, thereby generating a trajectory of states $\mat{s}_0, \ldots, \mat{s}_T$ obtained by applying the action $\mat{a}_t = \Fun*{\pi}{\mat{s}_t}$ at every time step $t$.
The next state is generated using the (latent) transition function $f$, yielding $\mat{s}_{t+1} = \Fun*{f}{\mat{s}_t, \mat{a}_t}$.
Many environments have stochastic elements, such as the random drift in the Wet-Chicken benchmark from \cref{sec:wetchicken}.
We take this stochasticity into account by interpreting the problem from a Bayesian perspective where \cref{eq:discounted_return} specifies a generative model whose graphical model is shown in \cref{fig:graphical_model:rl}.
Because of the Markov property assumed in reinforcement learning, conditional independences between the state transitions yield a recursive definition of the state probabilities given by
\Prob{\mat{s}_{t+1} \given \mat{\theta}_\pi} &= \int \Prob{\Fun{f}{\mat{s}_t, \mat{a}_t} \given \mat{s}_t, \mat{a}_t} \Prob{\mat{a}_t \given \mat{s}_t, \mat{\theta}_\pi} \Prob{\mat{s}_t} \diff \mat{a}_t \diff \mat{s}_t \\
\Prob{\mat{r}_t \given \mat{\theta}_\pi} &= \int \Prob{\mat{r}_t \given \mat{s}_t} \Prob{\mat{s}_t \given \mat{\theta}_\pi} \diff \mat{s}_t \\
\Moment*{\E}{\Fun*{J^\pi}{\mat{\theta}_\pi}} &= \sum_{t=0}^T \gamma^t \Moment*{\E_{\Prob{\mat{s}_t \given \mat{\theta}_\pi}}}{\mat{r}_t}
\Prob{\mat{s}_{t+1} \given \mat{\theta}_\pi} &= \int \Prob{\Fun{f}{\mat{s}_t, \mat{a}_t} \given \mat{s}_t, \mat{a}_t} \Prob{\mat{a}_t \given \mat{s}_t, \mat{\theta}_\pi} \Prob{\mat{s}_t} \diff \mat{a}_t \diff \mat{s}_t, \\
\Prob{r_t \given \mat{\theta}_\pi} &= \int \Prob{\Fun*{r}{\mat{s}_t} \given \mat{s}_t} \Prob{\mat{s}_t \given \mat{\theta}_\pi} \diff \mat{s}_t.
The discounted return is now itself uncertain and the goal can be reformulated to optimizing the expected return $\Moment*{\E}{\Fun*{J^\pi}{\mat{\theta}_\pi}} &= \sum_{t=0}^T \gamma^t \Moment*{\E_{\Prob{\mat{s}_t \given \mat{\theta}_\pi}}}{r_t}$.
In the following, we will apply a model-based approach to reinforcement learning, where we first train a probabilistic model of the transition function $f$ and then use this model for policy search.
In this context, the Bayesian approach has been shown to reduce model bias and increase data-efficiency by also explicitly taking model uncertainties into account \cite{deisenroth_pilco_2011}.
\todo[inline]{Fix with intro.}
\subsection{The MDGP Model}
\subsection{An Interpretable Transition Model}
We denote $\mat{\hat{s}}_t = \left( \mat{s}_t, \mat{a}_t \right)$ and collect $\mat{f}_t = \left( \mat{f}_t^{\pix{1}}, \dots, \mat{f}_t^{\pix{K}} \right)$.
@ -177,22 +194,22 @@ We denote $\mat{\hat{s}}_t = \left( \mat{s}_t, \mat{a}_t \right)$ and collect $\
\Variat{\mat{s}^\prime_\ast \given \mat{\hat{s}}_\ast}
\Variat{\mat{s}^\prime_\ast \given \mat{\hat{s}}_\ast}
&= \int \sum_{k=1}^K \Variat{l_\ast^{\pix{k}} \given \mat{\hat{s}}_\ast} \Variat{\mat{s}_\ast^{\prime\pix{k}} \given \mat{\hat{s}}_\ast} \diff \mat{l}_\ast \\
&\approx \sum_{k=1}^K \tilde{l}_\ast^{\pix{k}} \mat{\tilde{s}}_\ast^{\prime\pix{k}}
\subsection{Policy Improvement}
\subsection{Policy Learning}
&= \sum_{t=0}^T \gamma^t \Moment*{\E_{\Prob{\mat{s}_t \given \mat{\theta}_\pi}}}{\mat{r}_t} \\
&\approx \sum_{t=0}^T \gamma^t \Moment*{\E_{\Variat{\mat{s}_t \given \mat{\theta}_\pi}}}{\mat{r}_t} \\
&= \int \sum_{t=0}^T \Bigg[ \gamma^t \Moment*{\E_{\Variat{\mat{s}_t \given \mat{\theta}_\pi}}}{\mat{r}_t} \Bigg] \Prob{\mat{s_0}} \prod_{t=0}^{T-1} \Variat{\mat{s}_{t+1} \given \mat{s}_t, \mat{\theta}_\pi} \diff \mat{s}_0 \dots \mat{s}_T \\
&\approx \frac{1}{P} \sum_{p=1}^P \sum_{t=0}^T \gamma^t \mat{r}_t^p
&\approx \frac{1}{P} \sum_{p=1}^P \sum_{t=0}^T \gamma^t r_t^p

@ -54,7 +54,7 @@

@ -13,18 +13,6 @@
urldate = {2018-10-02}
author = {Deisenroth, Marc Peter and Fox, Dieter and Rasmussen, Carl Edward},
date = {2015},
journaltitle = {Pattern Analysis and Machine Intelligence, IEEE Transactions on},
number = {2},
pages = {408--423},
title = {Gaussian Processes for Data-Efficient Learning in Robotics and Control},
url = {},
urldate = {2016-02-01},
volume = {37}
author = {Deisenroth, Marc and Rasmussen, Carl E.},
booktitle = {Proceedings of the 28th {{International Conference}} on Machine Learning ({{ICML}}-11)},
@ -104,19 +92,13 @@
urldate = {2018-11-09}
abstract = {The reparameterization trick enables optimizing large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss. While many continuous random variables have such reparameterizations, discrete random variables lack useful reparameterizations due to the discontinuous nature of discrete states. In this work we introduce Concrete random variables---continuous relaxations of discrete random variables. The Concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into a one-hot bit representation that is treated continuously, Concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-probability of latent stochastic nodes) on the corresponding discrete graph. We demonstrate the effectiveness of Concrete relaxations on density estimation and structured prediction tasks using neural networks.},
archivePrefix = {arXiv},
author = {Maddison, Chris J. and Mnih, Andriy and Teh, Yee Whye},
date = {2016-11-02},
eprint = {1611.00712},
eprinttype = {arxiv},
keywords = {Computer Science - Machine Learning,Statistics - Machine Learning},
primaryClass = {cs, stat},
shorttitle = {The {{Concrete Distribution}}},
title = {The {{Concrete Distribution}}: {{A Continuous Relaxation}} of {{Discrete Random Variables}}},
url = {},
urldate = {2018-09-12}
author = {Lange, Sascha and Gabel, Thomas and Riedmiller, Martin},
booktitle = {Reinforcement Learning},
date = {2012},
pages = {45--73},
publisher = {{Springer}},
title = {Batch Reinforcement Learning}
@ -164,7 +146,7 @@
urldate = {2018-10-02}
author = {Sutton, Richard S. and Barto, Andrew G.},
date = {1998},
isbn = {978-0-262-19398-6},