Controlling the cognitive costs of coordination during incident response in critical digital services

  • Laura Maguire Department of Integrated Systems Engineering, Ohio State University

Abstract

My research is on coordination during anomaly response (Watts-Perotti & Woods, 2007; Woods & Hollnagel, 2006) within distributed work groups responsible for site reliability of critical digital infrastructure. Resilient incident response in this domain requires a mix of synchronous and asynchronous activity for diagnosing and resolving threats. Anomaly response involves coordination across a joint cognitive system of multiple roles with different experience, responsibilities, expertise and models of how the system functions. Coordination incurs a cognitive cost for practitioners and prior research (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) has identified cognitive costs as an important aspect of coordination.
However, there is little empirical data available. Over the last two summers, I have been embedded in the CIO organization within IBM. I have had full access to the practitioners and was working side by side with the incident response teams tasked with keeping key services functional. During this time, I observed several examples of how this organization maintains adaptive capacity. I collected data through observations, interviewing, case analysis and critical incident debriefing to identify key characteristics of cognitive work in the domain, the activities surrounding anomaly response and the organizational mechanisms that support or constraint practitioner performance.
IT provides an excellent natural laboratory for studying cognitive work (Allspaw, 2015; Grayson, 2018) as the systems are highly abstracted and continually changing, they operate at scale and speed generating time pressured, highly ambiguous issues. In this domain, Woods’ theorem rings particularly true: “As the complexity of the system increases, the accuracy of any single agent’s own model of that system decreases” (Woods, 2017). Therefore, rapid coordination across often non co-located agents into ad hoc groups is crucial for bringing the right resources to bear on problems that span inter and intra-organizational boundaries. Predominantly, incident coordination is through online chat so transcripts can aid process tracing and provide insight into the mental models of participants. Interestingly, chat channels can host several hundred participants listening and looking in on responders which allow others to anticipate when their involvement is needed. However, this benefit can also become a distraction and add costs of coordination to response efforts forcing the core team into private channels. At this point I have generated a partial corpus of cases and initial results to guide further inquiry. The timing for Young Talents (YT) allows me to get advice to plan and carry out my dissertation study in the late summer or early fall.

Published
2019-06-17
Section
Young talents in resilience engineering program 2019