You don’t know your system is resilient until you’ve broken it on purpose. I believed our payment processing service was fault tolerant. We ran multi-AZ. We had health checks. We had auto scaling. We had all the boxes ticked on the Well-Architected review. Then us-east-1b had a networking event on a Tuesday afternoon, and we watched a service that was supposed to gracefully fail over instead fall flat on its face. The load balancer kept routing to unhealthy targets for nearly four minutes because our health check intervals were too generous. The database failover triggered but the application’s connection pool held stale connections for another two minutes after that. Six minutes of degraded service for a payment processor. That’s the kind of thing that gets you a phone call from someone whose title starts with “Chief.”
No comments yet. Log in to reply on the Fediverse. Comments will appear here.