+ 8
- 10

rebuttal/aistats_2019_author_response.md
View File
@ -1,21 +1,19 @@ | |||

# Author Feedback | |||

We would like to thank the reviewers for their comments and suggestions for improvement. We will address their comments below. | |||

We would like to thank the reviewers for their comments and suggestions. We will address them below. | |||

## R1, R2 and R3: Modelling Choices | |||

As described in the paper, we address the data association problem. Importantly, this problem is under-specified as any data can be explained using enough independent components. However, a large number of components naturally leads to uninformative models which do not recover the underlying generative process. This trade-off has to be addressed via assumptions on both the underlying functions and the associations, which in this Bayesian context is translated to the proper choice of priors. In the paper's introduction we discuss multiple approaches and their drawbacks, which usually stem from compromises necessary to facilitate inference. An example of such a drawback is that in [20], principled Bayesian inference of the associations is possible only if a constant Multinomial prior is used. This implies that the relative importance of different modes at prediction time is a global parameter, which is clearly limiting. Similar problems arise when using a (possibly truncated) Dirichlet process prior on the associations. | |||

As described in the paper, the data association problem is under-specified as any data can be explained using enough modes. However, many components naturally lead to uninformative models. This trade-off has to be addressed via assumptions on both the underlying functions and the associations. In the introduction, we discuss multiple approaches and their drawbacks, which usually stem from compromises necessary to facilitate inference. An example of such a drawback can be seen in [20] where inference on the associations is only possible with a constant Multinomial prior. This makes the relative importance of different modes a global parameter, which is clearly limiting. Similar problems arise when using a Dirichlet process prior on the associations. | |||

In our model, we seek to make broader and more informative assumptions. We assume that the relative importance of the different generating processes is dependent on the position in the input space, which is why we place GP priors on the $\alpha$ parameters. In contrast to mixtures of experts models however, we do not assume that the associations are independent of $y$ and recognize that multiple modes can be relevant at the same position in the input space. This assumption translates to the separation of $\alpha$ and $a$ in our model. While the $a_n$ model the concrete associations of the training data (as in [20] or [5]), we simultaneously learn about the mode distributions in the input space. As a side-effect, these assumptions also avoid overfitting via the second expectation in (6). Critically, we do not marginalize the $\alpha$ GPs but infer a posterior belief which can be used at prediction time. This clear separation also allows practitioners to easily encode prior knowledge about the mode distributions via the priors over $\alpha$. For example, different mean functions for the different modes can be used to encode that one of the modes is expected to explain more of the data than the others. Note that both [20] and truncated Dirichlet processes are a special case of our model. | |||

In our model, we seek to make assumptions from a broader class of priors, which allows us to make more informative assumptions while still providing a tractable inference scheme. By placing a spacial GP prior on the $\alpha$-parameters, we assume that the importance of the generating processes is dependent on the position in the input space. In contrast to mixtures of experts models, we do not assume that the associations are independent of $y$. This translates to the separation of $\alpha$ and $a$ in our model. While the $a_n$ model the concrete associations of the training data, we simultaneously learn about the mode distributions in the input space. At the same time we avoid overfitting via the second expectation in (6). Critically, we do not marginalize the $\alpha$ GPs, but infer a posterior belief which can be used at prediction time. This separation allows practitioners to easily encode knowledge about the modes via the priors on $\alpha$. For example, different mean functions can be used to encode that one of the modes is expected to explain more of the data. Note that both [20] and truncated Dirichlet processes are a special case of our model. | |||

## R1 and R2: Significance of the Inference Algorithm | |||

As stated above, the drawbacks in the discussed references are mostly due to the difficulties of inference in this model implied by the discrete choice of assigning data to the different modes. Both [20] and [5] avoid the discrete optimization problem by using a relaxation discussed in our paper where the data likelihood is calculated as a convex combination of all modes whose weights are the assignment probabilities. This is problematic for multiple reasons, most importantly because during inference the model can choose to explain data as a mixture of modes which is a contradiction to the model's assumptions. The inference algorithms must therefore somehow enforce the sparseness of the assignment probabilities. We propose a different and principled solution in which we enforce this sparseness during inference while still obtaining gradient information for continuous optimization. | |||

## R1 and R2: The Inference Algorithm | |||

The drawbacks in the related work are mostly due to the difficulties of inference implied by the discrete choice of assigning data to the different modes. Both [20] and [5] avoid this discrete optimization problem by using a relaxation discussed in our paper where the data likelihood is calculated as a convex combination of all modes via the assignment probabilities. This is problematic because the model can choose to explain data as a mixture of the modes which is a contradiction to the original assumptions. To avoid this, the sparseness of the assignments has to be achieved in an additional step. We propose a different and principled solution in which we enforce this sparseness during inference while still obtaining gradient information for continuous optimization. | |||

We are able to enforce sparseness of solutions using Concrete random variables because our inference algorithm does not rely on closed form solutions of specific integrals such as the combination of GP function values and the data likelihood. We achieve this by adapting doubly stochastic variational inference to our model, which results in a joint variational bound which jointly considers the association problem and models for the different modes. Optimizing the variational bound yields posterior belief about all parts of the model which can directly be applied to prediction. Importantly, we can use arbitrary likelihoods or kernel functions as long as we a re able to propagate single samples through the model. We illustrate this by using deep GP models as generative processes which can be trained inside our model via a direct extension to the inference algorithm. Additionally, inference is efficient due to stochastic optimization which allows us to sample the gradients of the variational bound by propagating single samples through the model, avoiding the potentially high costs of computation implied by approaches such as quadrature. | |||

Our inference scheme does not rely on a closed form solution for the marginal likelihood. We achieve this by adapting DSVI to our model, yielding a variational bound which jointly considers the association problem and the mode models. Optimizing the bound yields separate posterior belief about all parts of the model. Importantly, we can use arbitrary likelihoods or kernel functions as long as we are able to propagate single samples through the model. We illustrate this by using deep GP models as generative processes which can be trained inside our model via a direct extension of the inference algorithm. Additionally, inference is efficient due to stochastic optimization which allows us to sample the gradients of the variational bound by propagating single samples through the model, avoiding the potentially high costs of computation implied by approaches such as quadrature. | |||

## R1, R2 and R3: Interpretation of the Experiments and Prior Knowledge | |||

Due to the under-specification of the data association problem, results obtained from experiments are highly dependent on modelling assumptions and therefore assumed priors. In the second experiment, we want to recover signal from very noisy data (80% in the extreme case). Here, assumptions about the signal and the noise processes are critical for learning. Indeed, multiple approaches to robust regression quickly fail on this data set as shown in Table 1, the main reason being that the assumptions made by the respective models about the generative process are wrong and cannot easily be changed. We show that in our model, the assumptions about the different generative processes and the assignment process can easily be stated in terms of GP priors and likelihoods. Encoding the knowledge given by the problem statement that one of the modes is a noise process and does not contain any signal is enough recover the underlying signal about which we make no special assumptions. | |||

> NOTE(mrksr): Should we argue that hacks like a Student-t likelihood are not principled and should not be the model of choice? | |||

## R1, R2 and R3: Interpretation of the Experiments | |||

Due to the under-specification of the problem, results obtained from experiments are highly dependent on modelling assumptions. In the second experiment, we want to recover signal from very noisy data (up to 80%). Here, assumptions about the noise processes are critical for learning. Multiple approaches to robust regression quickly fail on this data set as shown in Table 1, the main reason being that the assumptions made by the respective models are wrong and cannot easily be changed. We show that in our model, the assumptions about the generative processes and the assignments can easily be stated in terms of GP priors and likelihoods. Encoding the problem statement in the form of a white kernel is enough recover the underlying signal about which we make no special assumptions. In contrast, a standard GP with a Student-t likelihood is not a principled modelling choice in this case as it does not reflect the generative process but rather becomes an approximation in the inference scheme. | |||

We agree that an experiment with $K > 2$ is illustrative and will replace the first experiment with a dataset with more modes in the final version of this paper. |