Cognitive systems have a naturally high demand for computational power especially during training and can benefit from a flexible allocation of resources. But even classic automation systems start to introduce cloud technologies in newer evolutions as they promote modular concepts, can help reduce complex dependencies and enable new features. But there are several pitfalls that can occur when cloud technologies come into play.
In no particular order and without any expectation of completeness, this article provides a list of typical pitfalls based on our experience:
- Vendor lock-in: Cloud vendors provide feature-rich APIs that can quickly lead to high costs resulting from a change to the cloud platform. A single cloud platform (even with multiple availability zones) can be a single point of failure. If platform independent tools like terraform are used early on, the migration effort can be reduced (but can make development slightly more complex).
- Vendor level of service: Selecting the right cloud partner is crucial – do you want to keep the operational support in-house or offload this responsibility? Finding the perfect balance between managed services (and high dependency on the selected cloud provider) and merely renting infrastructure services can be much more than comparing costs.
- Runaway costs: Spending on cloud services can quickly get out of hand without properly implemented controls (depending on the cost model of the cloud provider).
- Static system design without context: With cloud technologies, a system can easily be adapted to different situations by changing the deployment on-the-fly. This is best utilized if the system is designed to adapt to its operational domain. Single services can work independently of the operational domain, i.e., the context the system is currently running in. However, the overall system needs a defined strategy to recognize and react to context changes. Important context dimensions should be identified explicitly. Adaptation can range from a couple of pre-defined modes of operation to a fully dynamic adaptation based on KPIs. For example, the expected variations in environment parameters of autonomous vehicles are described with a so-called Operational Design Domain (ODD that may have partitions for different modes of operation.
- ‘Singleton’ services: Legacy software was often written with an (implicit) assumption that there will only be a single concurrent instance running. However, requiring a single instance by design eliminates any possible use of scalability or redundancy.
- Hidden monoliths: Tightly weaving services into a holistic control loop and, thereby, effectively creating a monolith that is just spread across the services. This often hinders scaling an individual service. (Micro)Services should have a clearly defined function (including inputs and outputs) that can be accomplished within an independent (ideally re-entrant) control loop.
Within the white paper »Flexilient End-to-End Architectures« Christian Drabek and Anna Kosmalska of Fraunhofer IKS discuss the main challenges of dependable cloud-based cyber-physical systems.
- Neglecting real-time behavior: The absence of hardware dependencies in cloud services can cause physical properties like time to be overlooked. But cloud services are often part of cyber-physical systems, which usually have physical limits on how long a response can be delayed safely or reasonably. For example, connected autonomous systems might benefit from reduced battery drain if they offload computationally intensive tasks to a backend service. Nevertheless, they should not have to stop to wait for a response. When designing a cloud-based system, such time requirements must be considered – both when tailoring the services and when deciding where to deploy them.
- Version-hell: Undocumented dependencies of services to specific versions of other services; inability to identify the version of a running service or compatibility of interfaces; unplanned upgrade/downgrade of single services for debugging/fixing; undetected partial incompatibility can lead to sporadic errors that are hard to identify. In general, a compatibility concept is harder to integrate later.
- Inadequate monitoring: Especially with the use of cloud technologies, several layers of abstraction and fault-tolerance can make noticing errors and load peaks within the system difficult. It is almost impossible to detect or debug them without automated monitoring. On the other hand, too much monitoring can slow down the system unnecessarily or lead to analysis paralysis. Adequate monitoring with automated data aggregation and analysis is required. A lot of data can be collected, but the data is only valuable if it can be leveraged to take actions, e.g., by changing the deployment to adapt to the new KPIs. Of course, the collected data can also be used to evaluate and improve the performance of cognitive systems.
- Dev-Ops cycle: Cloud technologies permit a continuous rollout of updates through which superior quality software can be developed quickly and reliably. However, this also requires new approaches to deploying updates and monitoring the system’s behavior – to uncover potential flaws early on and fix them on-the-fly. A Dev-Ops cycle defines the necessary steps, as this opportunity also introduces new potential points of failure.
- Undocumented design decisions: The architecture of a system not only comprises the components of the system, but also the reason for choosing and shaping the parts. If this is not documented, knowledge about this will deteriorate. New changes cannot be checked easily if they align with existing design decisions. Thereby, inconsistencies might be found later when it is much harder to fix them. Documentation of design decisions was already found to be neglected frequently in architecture reviews for classic system architectures. It is even more the case with the possibility to rapidly evolve a system using cloud technologies.
- Security and privacy: Cloud technologies expose services (at least partially) to the public. e.g., unsecured endpoints, (fixed) credentials embedded in code, unencrypted communication, missing security patches, (accidental) misconfiguration, hard to detect breaches, privacy regulations.
- Holy grail: Cloud computing and the use of cloud technologies will not solve all challenges faced by modern systems and can introduce its own disadvantages if not addressed properly. For example, a disaster recovery plan is still necessary, and systems might become more complicated as components are spread across multiple platforms. Ultimately, cloud computing merely enables the option of paying someone else to run hardware somewhere else and guarantees the promises of a Service Level Agreement (SLA).