We would like to thank the reviewers for their comments and suggestions. We will address them below.
As described in the paper, the data association problem is under-specified as any data can be explained using enough modes. However, many components naturally lead to uninformative models. This trade-off has to be addressed via assumptions on both the underlying functions and the associations. In the introduction, we discuss multiple approaches and their drawbacks, which usually stem from compromises necessary to facilitate inference. An example of such a drawback can be seen in  where inference on the associations is only possible with a constant Multinomial prior. This makes the relative importance of different modes a global parameter, which is clearly limiting. Similar problems arise when using a Dirichlet process prior on the associations.
In our model, we seek to make assumptions from a broader class of priors, which allows us to make more informative assumptions while still providing a tractable inference scheme. By placing a spacial GP prior on the $\alpha$-parameters, we assume that the importance of the generating processes is dependent on the position in the input space. In contrast to mixtures of experts models, we do not assume that the associations are independent of $y$. This translates to the separation of $\alpha$ and $a$ in our model. While the $a_n$ model the concrete associations of the training data, we simultaneously learn about the mode distributions in the input space. At the same time we avoid overfitting via the second expectation in (6). Critically, we do not marginalize the $\alpha$ GPs, but infer a posterior belief which can be used at prediction time. This separation allows practitioners to easily encode knowledge about the modes via the priors on $\alpha$. For example, different mean functions can be used to encode that one of the modes is expected to explain more of the data. Note that both  and truncated Dirichlet processes are a special case of our model.
The drawbacks in the related work are mostly due to the difficulties of inference implied by the discrete choice of assigning data to the different modes. Both  and  avoid this discrete optimization problem by using a relaxation discussed in our paper where the data likelihood is calculated as a convex combination of all modes via the assignment probabilities. This is problematic because the model can choose to explain data as a mixture of the modes which is a contradiction to the original assumptions. To avoid this, the sparseness of the assignments has to be achieved in an additional step. We propose a different and principled solution in which we enforce this sparseness during inference while still obtaining gradient information for continuous optimization.
Our inference scheme does not rely on a closed form solution for the marginal likelihood. We achieve this by adapting DSVI to our model, yielding a variational bound which jointly considers the association problem and the mode models. Optimizing the bound yields separate posterior belief about all parts of the model. Importantly, we can use arbitrary likelihoods or kernel functions as long as we are able to propagate single samples through the model. We illustrate this by using deep GP models as generative processes which can be trained inside our model via a direct extension of the inference algorithm. Additionally, inference is efficient due to stochastic optimization which allows us to sample the gradients of the variational bound by propagating single samples through the model, avoiding the potentially high costs of computation implied by approaches such as quadrature.
Due to the under-specification of the problem, results obtained from experiments are highly dependent on modelling assumptions. In the second experiment, we want to recover signal from very noisy data (up to 80%). Here, assumptions about the noise processes are critical for learning. Multiple approaches to robust regression quickly fail on this data set as shown in Table 1, the main reason being that the assumptions made by the respective models are wrong and cannot easily be changed. We show that in our model, the assumptions about the generative processes and the assignments can easily be stated in terms of GP priors and likelihoods. Encoding the problem statement in the form of a white kernel is enough recover the underlying signal about which we make no special assumptions. In contrast, a standard GP with a Student-t likelihood is not a principled modelling choice in this case as it does not reflect the generative process but rather becomes an approximation in the inference scheme.
We agree that an experiment with $K > 2$ is illustrative and will replace the first experiment with a dataset with more modes in the final version of this paper.