When Your "Multi-Cloud" Strategy Has a Single Cloud at Its Heart
Railway was offline for roughly eight hours on May 19-20, 2026 because Google Cloud incorrectly suspended their production GCP account. That part of the story is simple. The interesting part, and the one every infrastructure team should sit with for an afternoon, is that Railway also runs on AWS and on their own Railway Metal hardware, and those workloads went down too. Not because anything broke at AWS. Because the network control plane lived on GCP.
If you took anything away from the Vercel April 2026 incident, this is the next chapter of the same lesson: a vendor's bad day becomes your bad day in proportion to where your hard dependencies actually live, not where your workloads happen to run.
The 8-Hour Timeline
Times from Railway's official postmortem, all UTC:
- 22:10 May 19: Automated monitoring detects API health-check failures. On-call paged.
- 22:11 May 19: Dashboard returning 503. Users cannot log in.
- 22:19 May 19: Root cause identified - GCP has suspended Railway's production account.
- 22:22 May 19: P0 ticket filed with Google Cloud, account manager engaged.
- 22:29 May 19: Incident formally declared. GCP account access restored, but compute is stopped and disks are inaccessible.
- 23:09 May 19: First persistent disk back online.
- 01:38 May 20: Edge traffic flowing again, networking restored.
- 02:55 May 20: Dashboard accessible.
- 04:00 May 20: API, dashboard, and OAuth endpoints confirmed operational.
- 06:14 May 20: Moved to monitoring.
- 07:58 May 20: Resolved.
Detection to confirmed restoration: roughly six hours. Full resolution: just under ten. Real user-visible outage: eight.
The Suspension Was a Misfire, at Scale
Per Railway: "Google Cloud placed Railway's production account into a suspended status incorrectly, as part of an automated action." The same automated action hit many GCP accounts across the platform. No proactive notice preceded any of it. The specific policy or signal that triggered the misfire is not stated in the postmortem.
This pattern, cloud-side automated enforcement misfiring on a paying customer with no warning, is not new. It is also not theoretical. It happens often enough that any operator running on a single cloud account should treat "account suspended without warning" as a tier-one risk, not a black-swan edge case.
How a GCP Suspension Took Down AWS and Metal Workloads
This is the engineering meat of the postmortem and the part worth circulating internally.
Railway's edge proxies route traffic to workloads using a network control plane. That control plane API was hosted on machines running in Google Cloud. The edge maintained a cached routing table populated from that control plane.
When GCP suspended the account, the GCP-hosted control plane went dark immediately. The cached routing table kept the mesh forwarding traffic to non-GCP workloads for about an hour. Then the cache expired. The mesh tried to repopulate routes from the control plane. There was no control plane to query.
Railway puts it directly: "Despite the mesh continuing to operate for an hour, when the route cache expired, the mesh failed to re-populate the routing tables." At peak impact, "all Railway workloads across all regions were rendered unreachable", workloads on Railway Metal and AWS included, even though those environments were operating normally. GitHub then rate-limited Railway's OAuth and webhook integrations on top of that, blocking logins and builds.
So the headline failure was a GCP account suspension. The actual blast radius was determined by the architecture of route discovery. A single hard dependency on a control plane sitting in one cloud account turned a single-vendor incident into a platform-wide outage.
The Architecture Lesson Hidden in the Postmortem
Multi-cloud workload placement does not equal multi-cloud resilience. The question that determines real resilience is not "do we run workloads in more than one cloud" but "if the cloud account holding our control plane vanishes for eight hours, what stays up?"
For Railway, the answer turned out to be: very little. The mesh, the dashboard, the API, OAuth, all anchored on GCP. AWS workloads were technically fine; nothing could route to them.
This is the same class of architectural mistake as putting an HA database with three nodes inside one AWS account, or running a "redundant" pair of services that both reach a single Redis instance, or designing a DR site that authenticates through the primary site's SSO. The redundancy is real at one layer and theatrical at the layer that actually matters under failure.
What Railway Is Changing
The postmortem lists four architectural commitments:
- Remove the hard dependency on the GCP-hosted control plane for mesh operation, described as "making this a true mesh."
- Extend high-availability database shards across AWS and Metal so "should all instances in a particular cloud disappear instantly, database quorum will keep everything running."
- Remove Google Cloud services from the data plane's hot path, keeping them only for secondary or failover roles.
- New architectures for both data plane (connectivity to hosts) and control plane (dashboard access and management).
Railway also took ownership directly: "Railway owns our vendor choices, and we ultimately own this one. Your customers don't care whether the failure was Google or Railway; they see your product. Your uptime is our responsibility, and we'll keep delivering on it."
The postmortem does not state anything about data loss, SLA credits, or customer compensation.
Three Questions To Ask Your Own Architecture Today
Borrow Railway's failure mode and stress-test your own:
- Where does your control plane live? Service discovery, routing tables, secret stores, identity providers, deploy systems, configuration databases. List the cloud account each one runs in. If one of those accounts is locked out for eight hours, walk the failure tree minute by minute. Most teams find at least one nasty surprise.
- Are your HA quorums spread across cloud accounts, not just cloud regions? A three-node etcd cluster in one AWS account does not survive account suspension. Three regions inside one account is still one account.
- Do you have a runbook for "primary cloud account suspended"? Including who you call, what credentials are stored outside that cloud, and what the recovery target time looks like. Most teams have runbooks for region outages and zero for account-level failures.
Bottom Line
A cloud will eventually do something automated and wrong to your account. The architectural question is not whether that will happen, but how much of your platform stays online when it does. Railway's eight-hour outage is one of the cleanest worked examples we have seen of why "multi-cloud" needs to mean "multi-cloud control plane," not "multi-cloud spreadsheet."
If you want a focused review of where your own control-plane dependencies cluster, including cross-cloud quorum, secrets, and identity, our architecture planning and disaster recovery and backup teams run exactly that exercise with infrastructure teams. Eight hours of downtime is a generous price to pay for someone else's lesson.
Want to learn more?
Get in touch with our team to discuss how we can help your infrastructure.
Related News
Amazon OpenSearch Service Expands Graviton4 Support
AWS expanded Amazon OpenSearch Service support for Graviton4-based c8g, m8g, r8g, and r8gd instances in more regions during February 2026.
CloudTerraform 1.8 Released with Provider Functions
HashiCorp releases Terraform 1.8 with provider-defined functions, improved refactoring support, and better state management capabilities.