Failure Modes and Continuous Resilience

Failure Modes and Effects Analysis

The FMEA spreadsheet is used to capture and prioritize risks based on Severity, Probability and Detectability where each is rated on a 1 to 10 scale. A standard model for each follows, the exact values chosen are somewhat arbitrary, and some forms of FMEA use a 1 to 5 scale, but all we are trying to do is come up with a rough mechanism for prioritization, and in practice this is good enough for the purpose.

FMEA Severity
FMEA Probability

Application Layer FMEA

This first example FMEA models the application layer assuming it is implementing a web page or network accessed API. Each step in the access protocol is modelled as a possible failure mode, starting with authentication, then the access itself. This is followed by some common code related failure modes. For a specific application team, these should be discussed, prioritized and have additional failure modes added. Judgement and discussion is needed to finish filling in all the levels and actions, but some common failure modes have been completed.

Software Stack FMEA

The software stack failure modes start along the same lines, with authentication and a request response sequence analysis that needs to be repeated for each of the projects, packages and service dependencies. However the more specific failure modes relate to the control planes for services hosted in cloud regions. In general a good way to avoid customer visible issues caused by control plane failure modes is to pre-allocate identity, network, compute and storage/database structures wherever possible. The cost of failure should be weighed against the cost of mitigation.

Infrastructure FMEA

It’s not generally useful to talk about “what to do if an AWS zone or region has an outage” because it depends a lot on what kind of outage and what subset of services might be impacted. Service specific control plane outages are part of the software stack FMEA. If a datacenter building is destroyed by fire or flood, we have a very different kind of failure than a temporary power outage or cooling system failure, and that’s very different to losing connectivity to a building where all the systems are still running, but isolated. In practice, we can expect individual machines to fail randomly with very low probability, groups of similar machines to fail in a correlated way due to bad batches of components and firmware bugs, and extremely rare availability zone scoped events caused by power and cooling failures, bad weather, earthquake, fire and flood.

Operations and Observability

Misleading and confusing monitoring systems cause a lot of failures to be magnified rather than mitigated. While some of the failure modes can be prioritized with an FMEA, these higher level failures are better modelled using Systems Theoretic Process Analysis (STPA), which also captures the business level criticality of the application. The service interactions that make up the monitoring system can be examined starting with the same patterns used for the applications FMEA, but it’s more interesting to look at the interactions with human operators and derive hazards from the information flows.

Simplified STPA Model

There is a lot more to SPTA but a simplified approach shows how it can be applied to the problems of finding failure modes in high availability systems. One of the models shown in the book is our starting point, showing the controlled process (data plane), the automated controller (control plane), and the human controller (who is looking at dashboards).

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
adrian cockcroft

adrian cockcroft

Work: Technology strategy advisor, Partner at OrionX.net (ex Amazon Sustainability, AWS, Battery Ventures, Netflix, eBay, Sun Microsystems, CCL)