On Friday November 28, 2014 the Brussels AI-lab organized the Dutch-Belgian Reinforcement Learning workshop. This was the 3rd edition of the workshop, which was previously held in Delft and Maastricht. The event was hosted by the Vrije Universiteit Brussel and cosponsored by the BNVKI. In total, 8 researchers gave talks about their research on various aspects of reinforcement learning. The workshop concluded with an informal social gathering of the local reinforcement learning community. Abstracts for the talks can be found below:

**Frederik Ruelens (KULeuven): Batch Reinforcement Learning Techniques for Demand Response Applications**

**Abstract:** This talk presents how state-of-the-art advances in Batch Reinforcement Learning (BRL) can help transition Demand Response (DR) into practical implementations at household level. DR is considered a key enabling technology for integrating the increasing share of renewable energy. DR allows consumers to shift their consumption of controllable loads, such as heat pumps, electric water heaters and air-conditioners, to moments when prices are low. DR has been the topic of an extensive list of research projects and scientific papers. Most of these projects, define DR as a model-based control problem, requiring a model and an optimizer. A critical step in setting up a model-based controller, however, consists in estimating the model parameters and generating reasonable forecasts of e.g. the heat and electricity demand. This step becomes even more challenging considering the heterogenicity of the end users, i.e. different end users are expected to have different model parameters and even different models. As such, a large scale implementation of model-based controllers requires a stable and robust approach that is able to identify the appropriate model and the subsequent model parameters. Driven by this challenge, the presented work aims to give an overview of practical model-free control approaches and how they

**Tim Brys (Vrije Universiteit Brussel): Multi-Objectivization and Ensemble Techniques**

**Abstract: **Multi-objectivization is the process of taking a single-objective problem and transforming it into a multi-objective problem in order to solve the original problem faster and/or better. This is either done through decomposition of the original objective, or the addition of extra objectives, typically based on some (heuristic) domain knowledge. This process basically creates a diverse set of feedback signals for what is underneath still a single-objective problem. This naturally leads to the use of ensemble techniques to solve such problems. One can train a different (instance of an) algorithm on each of the signals, strategically combining their output for decision making. We argue for the combination of multi-objectivization and ensemble techniques as a powerful tool to boost solving performance in many domains and research areas, and describe how this strategy can be implemented in Reinforcement Learning, demonstrating with a range of experiments the potential of the approach.

**Daan Bloembergen (Maastricht University / University of Liverpool): Learning dynamics in (social) networks**

**Abstract**: Many real-world scenarios can be modelled as multi-agent systems, where multiple autonomous decision makers interact in a single environment. The complex and dynamic nature of such interactions prevents hand-crafting solutions for all possible scenarios, hence learning is crucial. Studying the dynamics of multi-agent learning is imperative in selecting and tuning the right learning algorithm for the task at hand. So far, analysis of these dynamics has been mainly limited to normal form games, or unstructured populations. However, many multi-agent systems are highly structured, complex networks, with agents only interacting locally. Here, we study the dynamics of such networked interactions, using the well-known replicator dynamics of evolutionary game theory as a model for learning. Different learning algorithms are modelled by altering the replicator equations slightly. In particular, we investigate lenience as an enabler for cooperation. We show that lenience improves coordination and speeds up convergence in complex social networks. Moreover, we investigate the impact of structural network properties on the learning outcome, and find that more densely connected networks yield a higher level of cooperation.

**Madalina Drugan (Vrije Universiteit Brussel): Multi-objective multi-armed bandits**

**Abstract:** Multi-objective multi-armed bandits (MOMAB) paradigm extends the multi-armed bandits (MAB) to reward vectors instead of reward values. MOMAB differs from standard MAB in important ways since several arms can be considered to be optimal according to their reward tuples. Techniques from multi-objective optimisation are used to create MOMAB algorithms with efficient exploration/exploitation trade-off for complex and large multi-objective stochastic environments. Theoretical analysis is an important aspect of the MOMAB paradigm because MABs are considered a simplified theoretical framework of reinforcement learning with a single state. We give an overview of the MOMAB algorithms, their analysis and the corresponding experimental methodology

**Subramanya Nageshrao (TU Delft): Exploiting Submodular Value Functions for Faster Dynamic Sensor Selection**

**Abstract**: 20^{th} century was a golden era for linear control theory as numerous implementable algorithms were developed to solve linear synthesis problem. Unfortunately this epoch is missing in the history of nonlinear control. This can be partly attributed to the available nonlinear methods. Generally, these methods are mathematically elegant and often allude to asymptotic stability. However, they are extremely difficult to solve, because more than often the nonlinear synthesis problem involves solving a complex set of (underdetermined) equations. One upcoming methodology to address this hurdle is to use the advances made in AI community, particularly in machine learning, to solve the feedback control problem. This brings us to the self-learning feedback control approaches. In the presentation the capabilities of ‘reinforcement learning’ to solve a specific type of nonlinear control problem called ‘passivity based control’ was shown.

**Yash Satsangi (University of Amsterdam): Exploiting Submodular Value Functions for Faster Dynamic Sensor Selection**

**Abstract:** A key challenge in the design of multi-sensor systems is the efficient allocation of scarce resources such as bandwidth, CPU cycles, and energy, leading to the dynamic sensor selection problem in which a subset of the available sensors must be selected at each timestep. While partially observable Markov decision processes (POMDPs) provide a natural decision-theoretic model for this problem, the computational cost of POMDP planning grows exponentially in the number of sensors, making it feasible only for small problems.

We propose a new POMDP planning method that uses greedy maximization to greatly improve scalability in the number of sensors. We show that, under certain conditions, the value function of a dynamic sensor selection POMDP is submodular and use this result to bound the error introduced by performing greedy maximization. Experimental results on a real-world dataset from a multi-camera tracking system in a shopping mall show it achieves similar performance to existing methods but incurs only a fraction of the computational cost, leading to much better scalability in the number of cameras.

**Davide Zambrano (Centrum Wiskunde & Informatica): Toward a Continuous Time AuGMEnT**

**Abstract:** How can we solve problem when the solution requires internal representation of elapsed time? How do we efficiently deal with long delays in a reinforcement learning task? Many tasks require that the actions have to be selected at the right time and in the right order. In standard serially compound representations, long delays either correspond to a great many state-action transitions (and corresponding difficulty in learning), or the temporal resolution has to be reduced substantially. In both cases the discrete time representation fails in reproducing signals that match the real measured data.

The Attention Gated Reinforcement Learning (AGREL) model uses selective attention and neuromodulatory signals to determine the plasticity of sensory representations (Roelfsema & van Ooyen, 2005). An extension of this model efficiently solves non-linear problems by involving working memory neurons in the learning scheme (Rombouts, Bohte, & Roelfsema, 2012). Attention-Gated MEmory Tagging (AuGMEnT) implements the SARSA algorithm in a biologically plausible neural network architecture. However, as in the most Reinforcement Learning formalization, these algorithms are defined as a compound formalization where a precise time representation is absent.

AuGMEnT has been originally defined to solve discrete-time Markov Decision Problems (MDP’s). A continuous time generalization of MDPs is known as Semi-Markov Decision Problems (SMDP’s) defined in (Bradtke & Duff, 1995). A continuos-space continuous-time representation has been described in (Doya, 2000).

Experimental results show:

- Delayed reward can successfully be implemented in AuGMEnT; a general increase of the number of trials needed to reach convergence has been observed.
- Adding the reward as an effective input to the network increases the plausibility of the network because:

– it solves the problem of the negative Temporal Difference (TD) error in case of omitted reward;

– it avoids an external reset of the value expectation in the terminal state.

- Shrinking the time-step affects the convergence rate, due to the lower probability that the network finds the right combination of actions during exploration, as explained in (Baird, 1993).
- However, using action duration of the same time course of the original time step, and thus updating the networks every dt using the same Tags of the previously selected action, brings the performances back to the original ones.

### Quentin Gemine (Université de Liège): A Reinforcement Learning Framework for Active Network Management

**Abstract:** In order to operate an electrical distribution network in a secure and cost-efficient way, it is necessary, due to the rise of renewable energy-based distributed generation, to develop Active Network Management (ANM) strategies. These strategies rely on short-term policies that control the power injected by generators and/or taken off by loads in order to avoid congestion or voltage problems. While simple ANM strategies would curtail the production of generators, more advanced ones would move the consumption of loads to relevant time periods to maximize the potential of renewable energy sources. However, such advanced strategies imply solving large-scale optimal sequential decision-making problems under uncertainty, something that is understandably complicated. In order to promote the development of computational techniques for active network management, we detail a generic procedure for formulating ANM decision problems as Markov decision processes. We also specify it to a 75-bus distribution network. The resulting test instance is available at http://www.montefiore.ulg.ac.be/~anm/. It can be used as a test bed for comparing existing computational techniques, as well as for developing new ones. A solution technique that consists in an approximate multistage program is also illustrated on the test instance.