You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

311 lines
20 KiB

\input{preamble/packages.tex}
\input{preamble/abbreviations.tex}
\input{figures/preamble/tikz_colors.tex}
% We use precompiled images and do not add tikz for speed of compilation.
\newcommand{\includestandalonewithpath}[2][]{%
\begingroup%
\StrCount{#2}{/}[\matches]%
\StrBefore[\matches]{#2}{/}[\figurepath]%
\includestandalone[#1]{#2}%
\endgroup%
}
\addbibresource{zotero_export.bib}
\voffset 0 cm \hoffset 0 cm \addtolength{\textwidth}{0cm}
\addtolength{\textheight}{0cm}\addtolength{\leftmargin}{0cm}
\begin{document}
\title{Interpretable Dynamics Models for Data-Efficient Reinforcement Learning}
\author{Markus Kaiser$^{1,2}$%
\thanks{The project this report is based on was supported with funds from the German Federal Ministry of Education and Research under project number 01\,IS\,18049\,A.}
\ and Clemens Otte$^1$
and Thomas Runkler$^{1,2}$
and Carl Henrik Ek$^3$
%
\vspace{.3cm}\\
%
1- Siemens AG, Germany.\quad 2- Technical University of Munich, Germany.\\
3- University of Bristol, United Kingdom.
}
\maketitle
\begin{abstract}
In this paper, we present a Bayesian view on model-based reinforcement learning.
We use expert knowledge to impose structure on the transition model and present an efficient learning scheme based on variational inference.
This scheme is applied to a heteroskedastic and bimodal benchmark problem on which we compare our results to NFQ and show how our approach yields human-interpretable insight about the underlying dynamics while also increasing data-efficiency.
\end{abstract}
\section{Introduction}
\label{sec:introduction}
In reinforcement learning (RL)~\cite{sutton_reinforcement_1998}, an agent's task is to learn a policy $\pi$ which, given the current state $\mat{s}$ of an environment, chooses an action $\mat{a}$ to achieve the goal specified by a reward function $r$ mapping states to numerical rewards.
The next state $\mat{s}^\prime = \Fun*{f}{\mat{s}, \mat{a}}$ is determined by the latent and possibly stochastic transition function $f$.
We consider batch RL problems~\cite{lange_batch_2012}, where we are presented with a set of state transitions $\D = \Set{(\mat{s}_n, \mat{a}_n, \mat{s}_n^\prime)}_{n=1}^N$ and are unable to interact with the original system to find a policy.
This setup is common in industrial applications of RL, where deploying an untrusted policy can lead to safety issues.
Similarly, gathering exploration data can be costly, calling for data-efficient methods.
\Textcite{deisenroth_pilco_2011} showed how data-efficiency in model-based RL can be increased with probabilistic models for the transition dynamics $f$.
They provide a principled way of taking model uncertainty into account when evaluating the performance of a policy, thereby reducing the impact of model-bias.
A shortcoming of this approach is the limitations put on modelling choices by the inference scheme.
Transition dynamics are modelled as standard Gaussian processes (GPs) and policies and rewards must be of specific forms.
In this work, we extend this approach by allowing and imposing additional structure.
In many environments, experts can describe abstract properties of the system even if no closed form models are available.
Incorporating this knowledge facilitates learning and allows us to precisely state what we want to learn from data.
The paper is outlined as follows.
After introducing the heteroscedastic and bimodal Wet-Chicken benchmark, we show how high-level knowledge about this system can be used to impose Bayesian structure.
We derive an efficient inference scheme for both the dynamics model and for probabilistic policy search based on variational inference.
We show that this approach yields interpretable models and policies and is significantly more data-efficient than less interpretable alternatives.
\section{The Wet-Chicken Benchmark}
\label{sec:wetchicken}
In the Wet-Chicken problem~\cite{hans_efficient_2009}, a canoeist is paddling in a two-dimensional river.
The canoeist's position at time $t$ is given by $\mat{s}_t = (x_t, y_t)$, where $x_t$ denotes the position along the river and $y_t$ the position across it.
The river is bounded by its length $l = 5$ and width $w = 5$.
There is a waterfall at the end of the river at $x = l$.
The canoeist wants to get close to the waterfall to maximize the reward $\Fun*{r}{\mat{s}_t} = x_t$.
However, if the canoeist falls down the waterfall he has to start over at the initial position $(0, 0)$.
The river's flow consists of a deterministic velocity $v_t = y_t \cdot \sfrac{3}{w}$ and stochastic turbulence $b_t = 3.5 - v_t$, both of which depend on the position on the $y$-axis.
The higher $y_t$ the faster the river flows but also the less turbulent it becomes.
The canoeist chooses his paddle direction and intensity via an action $\mat{a}_t = (a_{t,x}, a_{t,y}) \in [-1, 1]^2$.
The transition function $f : (\mat{s}_t, \mat{a}_t) \mapsto \mat{s}_{t+1} = (x_{t+1}, y_{t+1})$ is given by
\begin{align}
x_{t+1} &= \begin{cases}
0 & \text{if } \hat{x}_{t+1} > l \\
0 & \text{if } \hat{x}_{t+1} < 0 \\
\hat{x}_{t+1} & \text{otherwise}
\end{cases} &
y_{t+1} &= \begin{cases}
0 & \text{if } \hat{x}_{t+1} > l \text{ or } \hat{y}_{t+1} < 0 \\
w & \text{if } \hat{y}_{t+1} > w \\
\hat{y}_{t+1} & \text{otherwise}
\end{cases}
\end{align}
where $\hat{x}_{t+1} = x_t + (1.5 \cdot a_{t, x} - 0.5) + v_t + b_t \cdot \tau_t$ and $\hat{y}_{t+1} = y_t + a_{t, y}$ and $\tau_t \sim \Fun*{\Uni}{-1, 1}$ is a uniform random variable that represents the turbulence.
There is almost no turbulence at $y = w$, but the velocity is too high to paddle back.
Similarly, the velocity is zero at $y = 0$, but the canoeist can fall down the waterfall unpredictably due to the high turbulence.
A successful canoeist must find a trade-off between the stochasticity and uncontrollable velocities in the river to get as close to the waterfall as possible.
% However, as the canoeist moves closer to the waterfall, the distribution over the next states become increasingly more bi-modal as the probability of falling down increases.
% Together with the heteroscedasticity introduced by the turbulence dependent on the current position, these properties make the Wet-Chicken problem especially difficult for model-based reinforcement learning problems.
\section{Probabilistic Policy Search}
\label{sec:probabilistic_policy_search}
\begin{figure}[t]
\centering
\begin{subfigure}[b]{.495\linewidth}
\centering
\includestandalone{figures/graphical_model_rl}
\end{subfigure}
\hfill
\begin{subfigure}[b]{.495\linewidth}
\centering
\includestandalone{figures/graphical_model_mdgp}
\end{subfigure}
\caption{
\label{fig:graphical_model}
The graphical models considered in this work, where violet nodes are observed, parameters are shown in yellow and variational parameters are blue.
The generative process for the return $J^\pi$ (left) shows how starting from $\mat{s}_0$, a trajectory of length $T$ is generated with the policy parameterized by $\mat{\theta}_\pi$.
The return is generated by the rewards which depend on their respective states only.
The transition model (right) separates the flow-behaviour of the river $\mat{f}_t$, the heteroscedastic noise process $\mat{\sigma}_t$ and the possibility of falling down $\mat{\lambda}_t$.
Latent variables $\mat{l}_t$ represent the belief that the $\nth{t}$ data point is a fall-down event.
}
\end{figure}
We are interested in finding a policy specified by the parameters $\mat{\theta}_\pi$ which maximizes the discounted return
$J^\pi(\mat{\theta}_\pi) = \sum_{t=0}^T \gamma^t \Fun*{r}{\mat{s}_t} = \sum_{t=0}^T \gamma^t r_t$.
Starting from an initial state $\mat{s}_0$ we generate a trajectory of states $\mat{s}_0, \ldots, \mat{s}_T$ obtained by applying the action $\mat{a}_t = \Fun*{\pi}{\mat{s}_t}$ at every time step $t$.
The next state is generated using the (latent) transition function $f$, yielding $\mat{s}_{t+1} = \Fun*{f}{\mat{s}_t, \mat{a}_t}$.
Many environments have stochastic elements, such as the random drift in the Wet-Chicken benchmark from \cref{sec:wetchicken}.
We take this stochasticity into account by interpreting the problem from a Bayesian perspective where the discounted return specifies a generative model whose graphical model is shown in \cref{fig:graphical_model}.
Because of the Markov property assumed in RL, conditional independences between the states yield a recursive definition of the state probabilities given by
\begin{align}
\begin{split}
\Prob{\mat{s}_{t+1} \given f, \mat{\theta}_\pi} &= \int \Prob{\Fun{f}{\mat{s}_t, \mat{a}_t} \given \mat{s}_t, \mat{a}_t} \Prob{\mat{a}_t \given \mat{s}_t, \mat{\theta}_\pi} \Prob{\mat{s}_t} \diff \mat{a}_t \diff \mat{s}_t, \\
\Prob{r_t \given \mat{\theta}_\pi} &= \int \Prob{\Fun*{r}{\mat{s}_t} \given \mat{s}_t} \Prob{\mat{s}_t \given \mat{\theta}_\pi} \diff \mat{s}_t.
\end{split}
\end{align}
With stochasticity or an uncertain transition model, the discounted return becomes uncertain and the goal can be reformulated to optimizing the expected return $\Moment*{\E}{\Fun*{J^\pi}{\mat{\theta}_\pi}} = \sum_{t=0}^T \gamma^t \Moment*{\E_{\Prob{\mat{s}_t \given \mat{\theta}_\pi}}}{r_t}$.
A model-based policy search method consists of two key parts~\cite{deisenroth_pilco_2011}:
First, a dynamics model is learned from state transition data.
Second, this dynamics model is used to learn the parameters $\mat{\theta}_\pi$ of the policy $\pi$ which maximize the expected return $\Moment*{\E}{\Fun*{J^\pi}{\mat{\theta}_\pi}}$.
We discuss both steps in the following.
\subsection{An Interpretable Transition Model}
\label{sub:mdgp}
We formulate a probabilistic transition model based on high-level knowledge about the Wet-Chicken benchmark.
Importantly, we do not formulate a specific parametric dynamics model as would be required to derive a controller.
Instead, we make assumptions on a level typically available from domain experts.
We encode that given a pair of current state and action $\mat{\hat{s}}_t = \left( \mat{s}_t, \mat{a}_t \right)$, the next state $\mat{s}_{t+1}$ is generated via the combination of three things:
the deterministic flow-behaviour of the river $\mat{f}_t$, some heteroscedastic noise process $\mat{\sigma}_t$ and the possibility of falling down $\mat{\lambda}_t$.
This prior imposes structure which allows us to explicitly state what we want to learn from the data and where we do not assume prior knowledge:
How does the river flow?
What kind of turbulences exist?
When does the canoeist fall down?
How do the actions influence the system?
We formulate a graphical model in \cref{fig:graphical_model} using the data association with GPs (DAGP) model~\cite{kaiser_data_2018}, which allows us to handle the multi-modality introduced by falling down the waterfall.
We specify this separation via the marginal likelihood
\begin{align}
\begin{split}
\label{eq:true_marginal_likelihood}
&\Prob*{\mat{s}_{t+1} \given \mat{\hat{s}}_t} =
\int
\Prob*{\mat{s}_{t+1} \given \mat{\sigma}_t, \mat{f}_t, \mat{l}_t}
\Prob*{\mat{\sigma}_t \given \mat{\hat{s}}_t}
\Prob*{\mat{f}_t \given \mat{\hat{s}}_t}
\Prob*{\mat{l}_t \given \mat{\hat{s}}_t}
\diff \mat{\sigma}_t \diff \mat{l}_t \diff \mat{f}_t, \\
&\Prob*{\mat{s}_{t+1} \given \mat{\sigma}_t, \mat{f}_t, \mat{l}_t} =
\prod_{k=1}^K
\Gaussian*{\mat{s}_{t+1} \given \mat{f}_t^{\pix{k}}, \left(\mat{\sigma}_t^{\pix{k}}\right)^2}_{^{\displaystyle,}}^{\Fun{\Ind}{l_t^{\pix{k}} = 1}}
\end{split}
\end{align}
where $\mat{f}_t = \left( \mat{f}_t^{\pix{1}}, \dots, \mat{f}_t^{\pix{K}} \right)$ and
$\Prob*{\mat{l}_t \given \mat{\hat{s}}_t} = \int \Multinomial*{\mat{l}_t \given \Fun{\softmax}{\mat{\lambda}_t}} \Prob*{\mat{\lambda}_t \given \mat{\hat{s}}_t} \diff \rv{\lambda}_t$
with $\Multi$ denoting a multinomial distribution.
In our case, we use $K = 2$ modes, one for staying in the river and one for falling down the waterfall.
For every data point we infer a posterior belief $\Prob{\mat{l}_t}$ about which mode the data point belongs to as we assume this separation can not be predetermined using expert knowledge.
We place independent GP priors on the $\mat{f}^{\pix{k}}$, $\mat{\sigma}^{\pix{k}}$ and $\mat{\lambda}^{\pix{k}}$.
We approximate the exact posterior via a factorized variational distribution
$\Variat*{\mat{f}, \mat{\lambda}, \mat{\sigma}, \mat{U}} = \prod_{k=1}^K\prod_{t=1}^T \Variat{\mat{f}_t^{\pix{k}}, \mat{u}^{\pix{k}}} \Variat{\mat{\lambda}_t^{\pix{k}}, \mat{u_\lambda}^{\pix{k}}} \Variat{\mat{\sigma}_t^{\pix{k}}, \mat{u_\sigma}^{\pix{k}}}$
which introduces variational inducing inputs and outputs $\mat{U}$ as described in~\cite{hensman_scalable_2015,kaiser_data_2018}.
The variational parameters are optimized by minimizing a lower bound to the marginal likelihood which can be efficiently computed via sampling and enables stochastic optimization.
For details we refer to~\cite{kaiser_data_2018}.
We obtain an explicit representation of the GP posteriors during variational inference which allows us to efficiently propagate samples through the model to simulate trajectories used for policy search.
\subsection{Policy Learning}
\label{sub:policy}
After training a transition model, we use the variational posterior $\Variat{\mat{s}_{t+1} \given \mat{\hat{s}}_t}$ to train a policy by sampling roll-outs and optimizing policy parameters via stochastic gradient descent on the expected return $\Moment*{\E}{\Fun*{J^\pi}{\mat{\theta}_\pi}}$.
The expected return is approximated using the variational posterior given by
\begin{align}
\begin{split}
\label{eq:policy_training}
\Moment*{\E}{\Fun*{J^\pi}{\mat{\theta}_\pi}}
&= \sum_{t=0}^T \gamma^t \Moment*{\E_{\Prob{\mat{s}_t \given \mat{\theta}_\pi}}}{\mat{r}_t}
\approx \sum_{t=0}^T \gamma^t \Moment*{\E_{\Variat{\mat{s}_t \given \mat{\theta}_\pi}}}{\mat{r}_t} \\
&= \int \sum_{t=0}^T \Bigg[ \gamma^t \Moment*{\E_{\Variat{\mat{s}_t \given \mat{\theta}_\pi}}}{\mat{r}_t} \Bigg] \Prob{\mat{s_0}} \prod_{t=0}^{T-1} \Variat{\mat{s}_{t+1} \given \mat{s}_t, \mat{\theta}_\pi} \diff \mat{s}_0 \dots \diff\mat{s}_T \\
&\approx \frac{1}{P} \sum_{p=1}^P \sum_{t=0}^T \gamma^t r_t^p.
\end{split}
\end{align}
We expand the expectation to explicitly show the marginalization of the states in the trajectory.
Due to the Markovian property of the transition dynamics, the integral factorizes along $t$.
The integral is approximated by averaging over $P$ samples propagated through the model starting from a known distribution of initial states $\Prob{\mat{s}_0}$.
State transitions can efficiently be sampled from the variational posterior of the dynamics model by repeatedly taking independent samples of the different GPs.
The expected return in \cref{eq:policy_training} can be optimized using stochastic gradient descent via the gradients $\nabla_{\theta_\pi} \Fun*{J^\pi}{\theta_\pi} \approx \frac{1}{P} \sum_{p=1}^P \sum_{t=0}^T \gamma^t \nabla_{\theta_\pi} r_t^p$ of the Monte Carlo approximation as they are an unbiased estimator of the true gradient.
The gradients of the samples can be obtained using automatic differentiation tools such as TensorFlow \cite{tensorflow2015-whitepaper}.
The $P$ roll-outs can trivially be parallelized.
Importantly, we only need a small number of Monte Carlo samples at every iteration, since we use the gradients of the samples directly.
\section{Results}
\label{sec:results}
\begin{figure}[t]
\centering
\begin{subfigure}{.495\linewidth}
\centering
\includestandalone{figures/falldown_probabilities}
\caption{
\label{fig:wetchicken:falldown}
Fall-down probability $\mat{\lambda}$
}
\end{subfigure}
\hfill
\begin{subfigure}{.495\linewidth}
\centering
\includestandalone{figures/hetero_noise}
\caption{
\label{fig:wetchicken:hetero}
heteroscedastic turbulence $\mat{\sigma}^{\pix{1}}$
}
\end{subfigure}
\\[.45\baselineskip]
\begin{subfigure}[b]{.495\linewidth}
\centering
\sisetup{
table-format=-1.2(2),
table-number-alignment=center,
separate-uncertainty,
% table-align-uncertainty,
table-figures-uncertainty=1,
detect-weight,
}
\newcolumntype{H}{>{\setbox0=\hbox\bgroup}c<{\egroup}@{}}
\footnotesize
\setlength{\tabcolsep}{1pt}
\begin{tabular}{cSSS}
\toprule
{N} & {NFQ} & {GP} & {DAGP} \\
\midrule
100 & 0.66 \pm 0.16 & \bfseries 1.41 \pm 0.01 & 1.18 \pm 0.09 \\
250 & 1.71 \pm 0.07 & 1.54 \pm 0.01 & \bfseries 2.33 \pm 0.01 \\
500 & 1.60 \pm 0.10 & 1.56 \pm 0.01 & \bfseries 2.25 \pm 0.01 \\
1000 & 1.99 \pm 0.06 & 2.13 \pm 0.01 & \bfseries 2.32 \pm 0.01 \\
2500 & \bfseries 2.26 \pm 0.02 & 1.91 \pm 0.01 & \bfseries 2.28 \pm 0.01 \\
5000 & \bfseries 2.33 \pm 0.01 & 1.91 \pm 0.01 & 2.28 \pm 0.01 \\
\bottomrule
\end{tabular}
\vspace{7ex}
\caption{
\label{fig:wetchicken:table}
Comparison of expected returns
}
\end{subfigure}
\begin{subfigure}[b]{.495\linewidth}
\centering
\includestandalone{figures/policy_quiver}
\caption{
\label{fig:wetchicken:policy}
A successful Wet-Chicken policy
}
\end{subfigure}
\caption{
\label{fig:wetchicken}
The separation of different aspects of the Wet-Chicken benchmark yields interpretable information about the probability to fall down the waterfall and the turbulence intensity.
Successful policies can be learned based on 250 observations, while about 2500 observations are needed for NFQ.
}
\end{figure}
To solve the Wet-Chicken problem, we first train the dynamics model on batch data sampled from the true dynamics.
The benchmark has a two-dimensional state and action spaces from which we sample uniform random transitions with varying $N$ in the range \numrange{100}{5000}.
With $N \geq 250$, our model is able to identify the underlying dynamics.
\Cref{fig:wetchicken:falldown,fig:wetchicken:hetero} show how the model has successfully identified the probabilities of falling down the waterfall and the amplitude of turbulence, both with respect to the action $(0, 0)$.
We are presented with easily separable posterior belief about different aspects of the Wet-Chicken benchmark.
This belief can be reasoned about with experts to evaluate the training result.
Next, we train a neural policy.
We sample initial states from the training data, use a horizon of $T = 5$ steps and average over $P = 20$ samples with $\gamma = 0.9$.
We use a two-layer neural network with 20 ReLU-activated units each as our policy parametrization.
\Cref{fig:wetchicken:policy} shows an example policy with a trade-off between the unpredictability on the left and the uncontrollable speed on the right.
In Table~\ref{fig:wetchicken:table}, we show expected returns averaged over 10 experiments with standard errors.
Applying random actions yields a return of about \num{1.5} and a return above \num{2.2} indicates that a proper trade-off has been found.
We compare our method to a standard GP as the dynamics model and to the model-free NFQ~\cite{riedmiller_neural_2005} trained for 20 full model learning and sampling iterations using a neural network with one 10-unit hidden layer with sigmoid activations.
The GP cannot model heteroscedastic noise or multi-modality.
It does not represent the dynamics well enough to derive a policy, illustrating our need for a more structured model.
Given enough data, NFQ is able to find successful policies.
However, our method requires about an order of magnitude less data, due to the high-level prior knowledge incorporated via the dynamics model.
\section{Conclusion}
\label{sec:conclusion}
In this paper, we demonstrated how expert knowledge can be incorporated in probabilistic policy search by imposing Bayesian structure on the learning problem.
We derived an efficient inference scheme and showed how our approach can solve the Wet-Chicken benchmark, yielding human-interpretable insights about the underlying dynamics and significantly increasing data efficiency.
\renewcommand*{\bibfont}{\footnotesize}
\printbibliography
\end{document}