Reinforcement learning is an area of machine learning that studies how intelligent systems (agents) can learn “what to do” solely by interacting with their environment. It is an iterative process, where the agent is allowed to take actions and as a result receives a new observation of its surroundings plus a scalar feedback signal (reward), telling it how well it is doing in solving a particular task. For example, if the agent is playing chess, it must propose valid moves and receives a high reward once it has won the game and a low reward if it loses.
Crucially, the agent is not told which actions to take, but instead must discover strategies that maximize the reward it receives as a result of its actions. This is what makes reinforcement learning so different from supervised learning, where the system is always provided with the optimal decision as a feedback signal. To learn complex tasks, the agent must therefore independently strike a balance between trying out new actions (exploration) and applying strategies that worked well in previous attempts (exploitation).
Not smart enough for the real world yet?
Deep reinforcement learning (DRL), the combination of deep learning and reinforcement learning, has been successfully applied to many complex problems, for example, learning how to play difficult board games such as Go, Chess or Shogi, playing computer games such as Super Mario or Space Invaders, or even controlling a robotic hand to solve a Rubik’s cube.
While it offers a promising path to solving tasks that currently cannot be solved by any other approaches, RL is not yet commonly applied to real-world scenarios. One of the main reasons for this is the lack of safety guarantees RL agents offer, especially in situations that substantially differ from their learning environment. Many state-of-the-art RL agents are trained in a so-called closed world setting. This means that it is assumed they will be deployed in the same environment as they are trained in.
While this assumption may hold true for many applications, for instance in computer games or board games, it can clearly not be justified in applications such as autonomous driving, where encountering novel situations is inevitable. Decision-making agents must therefore be designed for open world scenarios, where environmental observations may not be entirely coherent with the data that was observed during training. The problem is, however, that such samples can cause decision-making systems to produce completely irrational outputs, since they were not trained to handle them. Worse, systems purely designed for decision-making will not even notice unknown situations but over-confidently predict control outputs nonetheless, as if they were acting in a known situation. This was shown to be especially problematic for Deep Neural Network (DNN) based systems.
Warning for unknown traffic situations
Consequently, a trustworthy decision-making agent should not only produce accurate predictions in situations it understands, but also reliably detect situations it cannot handle and reject them (or hand them over to human users for safe handling). For instance, in a safety-critical application such as autonomous driving, the decision-making agent should be able to raise a safety alert when there are unknown obstacles on the road, and not blindly continue producing control commands.
Therefore, one way to approach this problem is by slightly reformulating it, using the “everything new is dangerous” assumption. This assumption puts everything not encountered during training outside the capabilities of the agent and effectively shifts the question from “which situations can the agent handle?” to “which situations does the agent know?”. In other words, instead of deliberately judging the capabilities of the agent, we expect it to function well exclusively in the training scenarios and simply try to detect novel or unknown scenarios.
Novelty Detection (ND) is a well-studied problem in the field of machine learning, addressing the problem of understanding and detecting differences between inputs that a system was trained on and inputs that it has been deployed on. However, existing work mostly focuses on supervised tasks such as image classification or object detection, where inputs can be strictly divided into ID (inside the training distribution) and OOD (outside the training distribution), see figure 1 a).
In RL, however, there are no classes or labels. Instead, we must turn to a scenario-based approach and define entire scenarios as ID and OOD. This makes the problem much more challenging. For instance, consider the scenario in figure 1 b), where during training the agent learned how to stack cubes on top of each other. Now, during testing (or deployment), one of the cubes is replaced either by a bigger cube or by a sphere. Both cases are visually different from what the agent saw during training, and therefore, a strict vision-based novelty detection approach would classify both cases as OOD, even though only the latter poses a situation the agent doesn’t know how to handle.
Towards real world RL
Detecting novel situations is therefore just a first step towards RL systems being readily applicable in real world scenarios. The “everything new is dangerous” assumption is arguably quite limiting, since the agent might still be able to handle novel situations by generalizing its knowledge (as for instance with the green cube). Therefore, better approaches towards estimating an agent’s capabilities in novel situations still needs to be found. Ideally, intelligent systems should be robust when it comes to irrelevant factors, have to ability to make generalizations regarding new situations and reliably raise the alarm for situations it cannot handle. At the Fraunhofer Institute for Cognitive Systems IKS we are actively working on solutions towards these problems.
Together with partners from industry and science, the Fraunhofer Institute for Cognitive Systems IKS is working on enabling autonomous intelligent systems to reliably recognize hazardous situations. Machine learning approaches such as reinforcement learning are the focus here.
This project was funded by the Bavarian State Ministry of Economic Affairs, Regional Development and Energy as part of the project Support for the Thematic Development of the Institute for Cognitive Systems.