My last metastable blog post discussed the interactions between systems and components and how they can lead to metastable failures. Specifically, I looked at interactions between systems/components and how signals can be misinterpreted by different systems due to ambiguity — a timeout may mean a transient fault that can be fixed by retrying, but it may also mean a catastrophic overload where retrying only makes things worse. This leads to an important realization: some common action systems take in response to signals can be good under some assumptions and bad under others. Last month, Todd Porter from Meta, Dave Maier from Portland State, and I gave a talk at SREConn Americas on Metastability in Recovery. This talk discusses how cross-system interactions and ambiguous assumptions can make recovery of system of systems impossible due to cascading amplifications and feedback loops. It is a common fact that failures can spread. A lot of engineering effort is often put into…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.