ExplainerMay 25, 2026·5 min read

Why Moving to the Cloud Didn't Eliminate Downtime — It Just Changed What Breaks

Cloud infrastructure promised reliability, but shifted failure modes from hardware to orchestration, dependencies, and configuration cascades.

Everyone sold cloud as the solution to downtime. Buy a reliable provider, distribute your workload, add redundancy, and watch uptime approach 100%. Yet companies with millions in cloud bills still experience outages. The difference is they're no longer failing because a hard drive died in a data center they owned. They're failing because a configuration change cascaded across availability zones, or because their DNS provider had a bad deploy, or because a third-party API they depended on went down and they had no fallback. The infrastructure got more reliable. The failure surface got much, much larger.

You Didn't Buy Reliability—You Bought Complexity

Cloud providers deliver impressive uptime SLAs because they control the hardware. What they don't control—and what you now own entirely—is the orchestration layer. Kubernetes, service meshes, auto-scaling policies, load balancer configurations, secrets management, container registries. Each of these components is a potential failure point. A misconfigured health check can cause a cascading restart loop. A bad deployment to your mesh control plane can black-hole traffic across your entire fleet. These failures are invisible to the cloud provider's monitoring. Their infrastructure is fine. Your software stack is in pieces. The irony: on-premises infrastructure forced you to think about these things upfront. Cloud made them optional until they weren't.

The Dependency Chain Is Your Real Problem

A single AWS outage affects thousands of companies simultaneously. But most cloud outages you'll experience aren't AWS going down—they're your dependencies failing in ways you didn't anticipate. Your CDN gets DDoS'd. Your payment processor has a bad deployment. Your observability vendor—the tool you use to detect outages—becomes unavailable. This is the non-obvious fact: cloud services have made it economically viable for companies to offer specialized infrastructure you now depend on. That's progress. But it means your uptime is now the product of the minimum reliability of every critical dependency, not the maximum reliability of your cloud provider. One weak link breaks the chain, and you have almost no visibility into most of these links until they fail.

Configuration Drift Is the Silent Killer

Infrastructure-as-code promised deterministic deployments. In practice, most teams have configuration living in multiple places: Terraform state files, Helm values, environment variables in CI/CD systems, secrets in vaults, runtime overrides in dashboards. A change to one isn't propagated to the others. A developer hotfixes something in production without updating the repo. A security team applies a policy that breaks a legacy service. A platform team upgrades a shared component and doesn't notify dependent teams. The system works until someone deploys, then suddenly the carefully balanced configuration collapses. Cloud infrastructure is so flexible that drift becomes the default state. You're not running what you think you're running.

How to Actually Reduce Your Downtime Risk

Stop assuming your cloud provider is the failure point. Audit your dependencies: list every external service your system needs to function. For each one, document its SLA and what you'll do if it fails. Build actual fallbacks—not theoretical ones. If your DNS provider fails, can you still route traffic? If your observability platform is down, can you detect that you're down? Treat configuration as immutable: if it's not in version control and automatically deployed from there, it doesn't exist. Implement circuit breakers for external dependencies aggressively. Run chaos engineering exercises focused on dependency failures, not infrastructure failures. Cloud gave you reliability at the infrastructure layer. The rest is your responsibility now.

Down checker guides
How to tell whether a site is actually down — and what each layer of the network can fail at.
See all down checker guides posts →
Run a status check
← Older
Why ChatGPT Goes Down So Often (And When It's Coming Back)