March 17, 2020
A resilient system continues to operate successfully in the presence of failures. There are many possible failure modes, and each exercises a different aspect of resilience. The system needs to maintain a safety margin that is capable of absorbing failure via defense in depth, and failure modes need to be prioritized to take care of the most likely and highest impact risks. In addition to the common financial calculation of risk as the product of probability and severity, engineering risk includes detectability. Failing silently represents a much bigger risk than when the same failure is clearly and promptly reported as an incident. Hence, one way to reduce risk is to make systems more observable. Another problem is that a design control, intended to mitigate a failure mode, may not work as intended. Infrequent failures exercise poorly tested capabilities that tend to amplify problems in unexpected ways rather than mitigate them, so it’s important to carefully exercise the system to ensure that design controls are well tested and operating correctly. Staff should be familiar with recovery processes and the behavior of the system when it’s working hard to mitigate failures. A learning organization, disaster recovery testing, game days, and chaos engineering tools are all important components of a resilient system.
It’s also important to consider capacity overload, where more work arrives than the system can handle, and security vulnerabilities, where a system is attacked and compromised.
The opening paragraph above is the same as my previous discussion focused on hardware, software and operational failure modes, but we are now in the middle of a pandemic, so I’m going to adapt the discussion of hazards and failure modes to our current situation.
There are many possible failure modes, and since they aren’t all independent, there can be a combinatorial explosion of permutations, as well as large scale epidemic failures to consider. While it’s not possible to build a perfect system, here are two good tools and techniques that can focus attention on the biggest risks and minimize impact on successful operations.
The first technique is the most generally useful. Concentrate on rapid detection and response. In the end, when you’ve done everything you can do to manage failures you can think of, this is all you have left when that virus that no-one has ever seen before shows up. Figure out how much delay is built into your observability system. Try to measure your mean time to respond (MTTR) for incidents. If your system is mitigating a small initial problem, but it’s getting worse, and your team responds and prevents a larger incident from happening, then you can record a negative MTTR, based on your estimate of how much longer it would have taken for the problem to consume all the mitigation margin. It’s important to find a way to record “meltdown prevented” incidents, and learn from them, otherwise you will eventually drift into failure [Book: Sydney Decker — Drift into Failure]. Systems that have an identifiable capacity trend, have a “time to live” (TTL) that can be calculated. Sorting by TTL identifies the systems that need attention first and can help focus work during a rapid response to a problem.
The second technique starts with the system constraints that need to be satisfied to maintain safe and successful operation and works in a top down manner using System Theoretic Process Analysis (STPA), or the more specific technique System Theoretic Accident Model and Processes (STAMP). [Book: Engineering a Safer World by Nancy G. Leveson]. STPA is based on a functional control diagram of the system, and the safety constraints and requirements for each component in the design. A common control pattern is divided into three layers, the business function itself, the control system that manages that business function, and the human operators that watch over the control system. The focus is on understanding the connections between components and how they are affected by failures. In essence in a “boxes and wires” diagram most people focus on specifying the boxes and their failure modes, and are less precise about the information flowing between boxes. With STPA there is more focus on the wires, what control information flows across them, and what happens if those flows are affected. There are two main steps: First identify the potential for inadequate control of the system that could lead to a hazardous state, resulting from inadequate control or enforcement of the safety constraints. These could occur if a control action required for safety is not provided or followed; an unsafe control is provided; a potentially safe control action is provided too early, too late or in the wrong sequence; or a control action required for safety is stopped too soon or applied for too long. For the second step each potentially hazardous control action is examined to see how it could occur. Evaluate controls and mitigation mechanisms, looking for conflicts and coordination problems. Consider how controls could degrade over time, including change management, performance audits and how incident reviews could surface anomalies and problems with the system design.
The criticality and potential cost of each failure mode is context dependent, and drives the available time and budget for prioritized mitigation plans. The entire resiliency plan needs to be dynamic, and to incorporate learnings from each incident, whether or not the failure has noticeable customer impact.
Applying this concept to a pandemic, the system we are controlling is the spread of infection in the human population, and the capacity of the healthcare system to triage the people who get sick. The control system attempts to detect the spread, applies rules to societies to limit human interactions that communicate disease, and allocates resources to provide capacity to the healthcare system. The human operator layer is the government and politicians who allocate controls and resources.
STPA Model for COVID-19
One of the models shown in the book is our starting point, showing the controlled process (data plane), the automated controller (control plane), and the human controller (who is looking at metrics to decide if the system is working or needs intervention).
If we change this model for the pandemic situation, the government is the human controller, their rules, laws and funding priorities are the automated controller, and the spread of the virus through the population and it’s treatment are the controlled process.
The hazards in this situation are that the government could do something that makes it worse instead of better. They could do nothing, because they hope the problem will go away on its own. They could freak out at the first sign of a virus and take drastic actions before it is needed. They could take actions too late, after the virus has been spreading for a while, and is harder to control. They could do things in the wrong order, like developing a custom test for the virus rather than using the one that’s already available. They could take a minimal action, not enough to stop the spread of the virus, and assuming it’s fixed. They could spend too long deciding what to do. They could get into internal arguments about what to do, or multiple authorities could make different or incompatible changes at once. The run-book of what to do is likely to be out of date (see studies of Spanish Flu) and contain incorrect information about how to respond to the problem in the current environment.
Each of the information flows in the control system should be examined to see what hazards could occur. In the monitoring flows, the typical hazards are a little different to the control flows. In this case, the sensor that reports infection counts could stop reporting, and get stuck on the last value seen (like the CDC report every weekend). It could report zero infections, even though people are still being infected. The data could be corrupted by political interference and report an arbitrary value. Readings could be delayed by different amounts so they are seen out of order. The update rate could be set too high so that the people can’t keep up with the latest news. Updates could be delayed so that the monitoring system is showing out of date status, and the effect of control actions aren’t seen soon enough. This often leads to over-correction and oscillation in the system, which is one example of a coordination problem. Sensor readings may degrade over time, especially between pandemics, when there is little attention being paid to the problem.
The STPA three level control structure provides a good framework for asking questions about the system. Is the model of the controlled process looking at the right metrics and behaving safely? What is the time constant and damping factor for the control algorithm, will it oscillate, ring or take too long to respond to inputs? How is the government expected to develop their own models of the controlled process and the automation, and understand what to expect when they make control inputs? How is the user experience designed so that the government is notified quickly and accurately with enough information to respond correctly, but without too much data to wade through or too many false alarms?
What Happens Next?
I think this discussion provides a high level model for understanding what is happening now. The essential component of a control system is a low latency and dependable way to measure the thing we are trying to control. This corresponds to the WHO guidelines, and the relatively successful policy of mass testing shown in South Korea in particular. It also shows why the UK and USA response of limited testing means that the pandemic is literally “out of control” in those countries. The UK’s short-lived policy of Herd Immunity was based on a bad model of the controlled process, where they hadn’t taken into account the expected death rate in the short term, and the lack of capacity in the healthcare system. Until the South Korean approach of mass testing is implemented globally, we won’t be able to control COVID-19.
We can expect pandemics to recur every few years, and this one is bad enough to setup some long term changes in the system that should provide resilience to new viruses and to recurrence of existing ones. One way to operate a global economy in the presence of viral pandemics is to have testing be a continuous part of everyone’s life and to be a gate on movement of people, even when there isn’t a pandemic. So in order to get on a scheduled airline flight or possibly event to attend a large public event you would have to take a test to show you aren’t carrying any of the known viruses that are bad enough to kill people. That could include the flu, but might not include the common cold. The cost per test for testing in huge volume can be driven down to a very low level over time.
The blanket application of shelter in place rules, as are in place in Santa Clara county (where I’m currently writing this) is currently affecting millions of people, with a few hundred confirmed cases. If everyone was tested regularly, then we could find the actual people who should be isolated and the rest of us would be confident that we aren’t spreading the virus, and get on with our lives. The social and financial costs of the shutdown are big enough that the ongoing blanket testing alternative may end up looking like a good deal.