© iStock/lindsay imagery
Resilience has become a popular term in recent years, in general. In the context of software dependability, it is often used as a new fancy synonym for old concepts. In fact, however, resilience is much more than traditional dependability and has been subject to more than 15 years of research. Therefore, it is important to understand what resilience actually is and that it is much more than just a new buzzword.
Nevertheless, it is very welcome that this valuable concept is moving a bit more into the spotlight. While it is primarily the story of safe artificial intelligence (AI) that is currently being told as the narrative on the success of cognitive systems such as driverless cars, it is the story of resilience that actually matters. Why and how resilience is becoming the essential ingredient for safe cognitive systems is the topic of this new series "The story of resilience" on the blog of Fraunhofer IKS.
The concept of resilience
While the term resilience has become increasingly popular in recent years, it has its roots as far back as the 15th century . After its first use as a juridical term, it was defined by Francis Bacon as a physical property - namely that a body can return to its original state after being affected by a force.
In the seventies of the last century, the term was coined, among many other disciplines, in the field of biological ecosystems . If, for example, an ant colony is relocated to another habitat, the colony will not focus on stability by retaining its previous structures and behavior patterns which would not work (well enough) anymore in the new context. Instead, it adapts its structures and behaviors to fit to the new context in order to survive and to thrive. In general, it is not the primary goal of a biological (eco-) system to rigidly stick to structures or behaviors, but it is the primary goal to survive and to thrive. Any biological system therefore continuously adapts its structures and behaviors in such a way that it can thrive under the continuously changing characteristics of its context. That’s what we as human beings permanently (must) do. And that is what the ant colony does. Biological systems adapt their structure and behavior to meet their superordinate goals in changing environments – that is what makes them resilient.
Unpredictable context of systems
This idea of resilience also influenced the first definition of the term in the context of software dependability by Jean-Claude Laprie: "The persistence of dependability when facing changes." . As a very important aspect, for Laprie this also includes changes that were not foreseen or that are even unforeseeable. This means that the concept of resilience accepts the fact that we are simply not able to predict the context of our systems and that we are thus not able to build static systems that will be dependable in any given context. Instead, resilience means that systems need to continuously adapt (either in an self-adaptive way or in a DevOps-like approach) to achieve their overarching goals in changing, uncertain contexts. Considering the open-world challenge in the context of driverless cars, for example, this sounds very familiar and the importance of resilience for such systems becomes clear.
Looking at these thoughts more generally, resilient software systems are characterized by three key properties:
- Resilience is an ability of systems to respond to a (strong) change in its operational context,
- by adapting itself to the context in such a way,
- that it can maintain or optimize essential properties of the system.
Regarding these characteristics also helps us to understand the differences to other terms. Starting with reliability, the latter tends to focus on internal problems such as faulty components or programming faults. Resilience, in contrast, is focusing on the external context. Even though changes in the context of a system will often be detected within the system - because, for example, a machine learning component detects a high level of uncertainty – the problem’s cause lies in the context.
Thus, resilience is somehow rather related to robustness, i.e. “the degree to which a system or component can function correctly in the presence of invalid inputs or stressful environmental conditions” . But resilience is much more than robustness. Resilience refers to changes in its context where it is not enough to do the same things just a little more rigorously, to build the same system just a little harder, more robustly. Resilience refers to changes in its context that require the system to adapt because the current structures and behavior of the system would no longer achieve the system’s goals under the new boundary conditions of the changed context – just like the ant colony needs to adapt its behaviors in order to survive in the new habitat.
Thus, resilience means that it does not make sense to strictly adhere to a statically fixed specification of structure and behavior. Instead, it becomes the focus to preserve important properties and to optimize the system’s goals. The behavior and the structure become flexible and can be adapted to achieve the overarching goals while at the same time important properties are preserved. In principle, it is very flexible how these goals and properties could look like. In our specific field, the key property that our systems shall preserve is safety. But we do not sell systems because they are safe, but because they safely provide a competitive functionality. Hence, we try to optimize the system’s utility while preserving safety in uncertain, unknown, and changing contexts. This leads to our definition of resilience: "Optimizing utility whilst preserving safety in uncertain contexts".
The importance of resilience for cognitive systems
A cyber-physical system always forms a unit with its environment with which it interacts. It will therefore only be able to function safely and reliably if the developers predict the environment correctly, understand it correctly, specify it correctly and develop the system correctly for it. Considering the aforementioned dynamics, however, it is simply impossible to adequately and completely predict, understand, and specify the context. Nor is it possible to develop a single static solution that fits to all the possible operational situations. Moreover, it is often easier to assure specific solutions for specific operational situations instead of putting everything into one single highly complex general solution. Therefore, for the vision of systems such as driverless cars to become a reality, it is essential to enable the systems to adapt to their context in order to maximize their utility while ensuring safety. Thus, we have no choice but to build resilient systems.
The resilience challenge
As soon as we enable systems to become aware of their own state, goals, and context in order to be able to adapt themselves to new contexts, we buy this with enormous complicatedness (in the next part of this series it will become clear why I don’t use the term complexity here).
In most cases, however, introducing resilience architectures and resilience engineering approaches does not create complexity, but only makes the already existing complexity visible. This is because adaptation mechanisms are already implemented implicitly as part of the functionality or "trained" into neural networks in today’s systems. As a result of such an implicit adaptation, adaptation behavior becomes a hidden, implicit part of the system that is impossible to control. Quality problems are pre-programmed, and safe solutions are impossible.
At the same time, it does not really make sense that we want to enable the system to adapt flexibly on the one hand and to carve in hard-coded "if-then-else"-type adaptations on the other hand. But given the current challenges of assuring safety of Machine Learning (ML), using ML for realizing adaptation instead of hard-coded rules would be rather counterproductive. Furthermore, resilience requires a semantic understanding of the system, its goals, and its context – something that (at least today's) neural networks are not capable of.
That leaves us with the question how to solve the resilience challenge. Even though software engineering is living in the shadow of AI these days, it already provides a promising toolbox for developing resilient systems. And self-adaptive systems are certainly an essential ingredient . For more than two decades, engineering methodologies and architectures for self-adaptive systems have been developed that form a promising basis for engineering resilient systems. If these approaches are combined with the concept of Dynamic Safety Management , the result is a stable foundation on the basis of which the indispensable step towards resilience can be taken and the resulting complexity can be mastered.
Resilience thus offers the potential, or in my view is even essential, for developing systems that must operate in highly complex, dynamic, uncertain environments. Accordingly, it is not surprising that many software companies already incorporate the foundation of resilience, namely adaptivity, as an implicit part of the functionality of their applications. However, decades of experience from other areas have shown that this approach is doomed to failure - despite the far too little flexibility of such an implicit approach, quality problems are pre-programmed, and safety assurance is hardly possible.
On the other hand, using the concepts that we know from self-adaptive systems and the like, we can master the complexity of resilient systems and hence turn the vision of cognitive systems into reality. To this end we need to think beyond the boundaries of sole safe AI. For building resilient systems, we must holistically consider AI, engineering methodologies, and appropriate architectures. We call this “Safe Intelligence”.
 S. Gößling-Reisemann, H. D. Hellige, und P. Thier, The Resilience Concept: from its historical roots to theoretical framework for critical infrastructure design, Bd. 217. Bremen: Universität Bremen, Forschungszentrum Nachhaltigkeit (artec), 2018.
 C. S. Holling, „Resilience and stability of ecological systems“, Annu. Rev. Ecol. Syst., Bd. 4, Nr. 1, S. 1–23, 1973.
 J.-C. Laprie, „From dependability to resilience“, in 38th IEEE/IFIP Int. Conf. On dependable systems and networks, 2008, S. G8–G9.
 I. O. for Standardization, I. E. Commission, I. of Electrical, und E. Engineers, ISO/IEC/IEEE 24765: 2017(E): ISO/IEC/IEEE international standard - systems and software Engineering–Vocabulary. IEEE.
 D. Weyns, An introduction to self-adaptive systems: A contemporary software engineering perspective. Wiley, 2020. [Online]. Verfügbar unter: https://books.google.de/books?...
 M. Trapp, D. Schneider, und G. Weiss, „Towards safety-awareness and dynamic safety management“, in 2018 14th European Dependable Computing Conference (EDCC), 2018, S. 107–111.
 M. Trapp und G. Weiss, „Towards Dynamic Safety Management for Autonomous Systems“, in Safety-Critical Systems Symposium: Engineering Safe Autonomy, 2019, Bd. 27, S. 193–205.
Read the second part here: The Story of Resilience, Part 2 – The Complexity Challenge