In the first presentation, Nico Roos proposed a model for describing a network of distributed services for task executions, and proved that for this model, even in a self-interested setting reinforcement learning can be applied to learn how to cooperate well. Specifically, when agents can decide which agent should perform the next subtask and can assess the quality of service of other agents, and share in rewards (and responsibility) for the completion of the full task, agents will learn to hand off their tasks to the best possible next agents. In this, agents do not have to be fully cooperative to learn how to cooperate – which is important, as they learn to assign blame when the quality of service is low – but have a natural incentive to be truthful, as well as to cooperate.

Then, Jelle Visser took the floor to introduce us to their approach to learning-from-demonstration as a prequel to independent reinforcement learning in Donkey Kong. This approach mitigates the problem of having to randomly explore – leading to a very long initial period in which the agent does not seem to learn well – when learning from scratch. By playing a hundred games (50 each by two people), they managed to initialise a policy and a critic as input critic for an actor-critic RL approach – ACLA – that resumes learning independently from human demonstration. When the initial policy was already reasonably good, the authors found that ACLA did not improve much anymore from the initial policy, but when the initial policy was not yet good, ACLA did help to improve it. The authors conclude that these preliminary results are inconclusive, and that more research is required to ensure that policies are improved further after initialisation by learning-from-demonstration.

Last but not least, Thomas Moerland introduced us to his approach to variational inference for learning multi-modal transition models in model-based RL. This is highly important, in order to capture the complex stochastic environment dynamics that underlie many real-world problems. They show that using variational autoencoders, they can learn a generative model of a multi-modal distribution for the transition dynamics by introducing unobserved latent variables. This result can possibly lead to many exciting new model-based RL methods based on (deep) neural networks in the future.