Approaching overload: diagnosis and response to anomalies in complex and automated systems
Web production software systems operate at an unprecedented scale today, requiring extensive automation to develop and maintain services. The systems are designed to regularly adapt to dynamic load to avoid the consequences of overloading portions of the network. As the software systems scale and complexity grows, it becomes more difficult to observe, model, and track how the systems function and malfunction. Anomalies inevitably arise, challenging incident responders or SREs to recognize and understand unusual behaviors as they plan and execute interventions to mitigate or resolve the threat of service outages.
A study of four real cases reveals the interplay between the human and machine agents when problems disrupt the system. The analysis of the incidents directly links the cascade of disturbances below the line of representation (e.g. computer interfaces, monitoring tools) with the cognitive work of Site Reliability Engineers. The Above the Line / Below the Line Framework (ABL) changes the perspective in reviewing the cases post mortem in the tradition of Cognitive Systems Engineering and Resilience Engineering. The case study demonstrates specific and general patterns for complications to incident management in complex web operation systems, as well as directions for designing better tooling to support future, resilient work.
Copyright (c) 2019 Marisa Grayson
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.