At a glance
- System
- An e-commerce API handling tens of thousands of orders daily.
- Symptom
- Auto-scaled servers joined the production fleet running broken, half-deployed code.
- Root cause
- A Lambda function with a 3-second timeout, about one second short of what it needed.
- Blast radius
- Every automatic scale-out event had a chance to corrupt the new server.
- Fix
- One configuration value. Timeout 3s to 30s.
- The twist
- Verifying the fix surfaced a second, unrelated bug: a deploy check that could never pass.
Part 1 - The symptom
A deployment that succeeds but leaves nothing behind
The platform runs its API on a fleet of servers behind a load balancer. When traffic rises, an Auto Scaling Group launches additional servers automatically. Each new server is expected to pull the latest application code, wire itself up, and start serving requests. One week, that stopped being reliable.
New servers were joining the fleet, but some of them were empty. The directory that should hold the application release was bare. The current symlink, which is supposed to point at the active release, pointed at nothing. The deployment logs, meanwhile, said everything had succeeded.
Expected state
current -> releases/20260504-1/
releases/20260504-1/
application code ✓
Actual state on broken servers
current - - > dangling
releases/ - empty
(nothing here)
A deployment that reports success but leaves nothing behind is the worst kind of failure: nothing alarms, nothing pages, and the gap is found only when a customer request lands on the hollow server and returns an error.
Part 2 - The setup
Scaling is a relay race between five services
To understand the bug, you need the shape of the system. Scaling is not a single action. It is a relay race between five services, each handing off to the next:
ASG
decides capacity is needed, launches a server
EventBridge
detects the launch event, invokes the Lambda
Lambda
fetches a CI token, triggers a GitHub Actions workflow
GitHub Action
the triggered workflow runs the deployment
New server
deployed, then registered with the load balancer
The Lambda is the smallest component in this chain. Its entire job: receive an ID, fetch a token, make one HTTPS call. It is the kind of function you write once and never look at again.
Part 3 - The investigation
The same request ID appears twice
The first instinct was to blame the deployment script. It was the wrong lead. The breakthrough came from the Lambda's logs. Every invocation writes a REPORT line when it finishes. Scanning a few hours of history surfaced something that should be impossible:
11:16:45 START RequestId: 9829b926-...
11:16:48 REPORT RequestId: 9829b926-... Duration: 3000 ms Status: timeout
11:17:55 START RequestId: 9829b926-...
11:17:58 REPORT RequestId: 9829b926-... Duration: 3000 msThe same request ID appears twice. The same EventBridge event ID appears twice. The same new-server ID appears in both.
One scale-out event, processed twice
When a Lambda invoked asynchronously fails - and a timeout is a failure - the event source automatically retries it. But the first invocation was not a clean failure. It was killed mid-flight, and in those 3 seconds it had already done the one thing that mattered: it had told the CI system to start a deployment. It just had not lived long enough to return a success response. So the retry did not redo a failed deployment. It started a second one.
Part 4 - The root cause
Two deployments racing on one server
Two deployments, both real, both aimed at the same new server, running roughly a minute apart. The deployment script was never designed for that. Like most deploy scripts, it assumes it is the only one running. Two copies, racing on the same machine, interleave their steps.
Two deploys racing on one server
Deploy A (first)
Deploy B (retry)
Deploy B's cleanup deletes old releases - including the directory Deploy A is still living in. The files vanish under a running process.
Now, why did the first invocation time out at all? The Lambda's timeout was 3 seconds. Whoever set it almost certainly measured a warm invocation: a cached token, one fast HTTPS call, done well under a second. But a Lambda that has not run recently starts cold - it must initialize its runtime first.
Warm invocation - fits
work ~0.8s, well under the 3s limit
Cold invocation - doomed
init + work ~4s, killed at the 3s mark
Cold starts are common precisely when scaling happens - scale-out events arrive in bursts after quiet periods. The timeout was not an edge case. It was the normalcase for this function's actual job.
Part 5 - The fix
One value: 3 seconds to 30 seconds
The fix was one value. The Lambda's timeout was raised from 3 seconds to 30. That is not a tuned number - the real worst case is about 4 seconds; 30 is a deliberate, generous ceiling. You are billed for actual execution time, not the limit, so an over-generous timeout costs nothing. A too-tight one cost a corrupted server. The asymmetry says: leave room.
Verified empirically on the next scale-out
- Two Lambda invocations fired - two servers launched
- Each had a distinct request ID - no retry, no duplication
- Each ran about 4 seconds and completed cleanly, no timeout marker
- Each deployment targeted its own server
One real event, one deployment, one healthy server. The relay ran clean.
Part 6 - The twist
A deploy guard that could never pass
While watching deployments to confirm the timeout fix, a second failure appeared, unrelated and in some ways more interesting. Deployments were now failing at a verification step. After the deploy script rebuilt the application's configuration cache, a guard checked that the cache was valid. The guard was written like this:
php artisan tinker --execute='exit(empty(config("app.key")) ? 1 : 0);' \
|| { echo "ERROR: config build failed"; exit 1; }The intent is sound: run a line of code, exit 0 if the key is present, exit 1 if not. The problem: the tool running that line, an interactive REPL in --execute mode, does not treat exit() as a normal exit. It intercepts it as an internal control exception and terminates with a non-zero exit code regardless of the value passed. So exit(0) and exit(1) produced the same failing code. The guard could never pass.
How correct information was destroyed
This bug had the opposite shape of the first. The timeout was a real failure that reported success. The guard was a real success that reported failure. Both are dangerous: the false negative quietly corrupts, the false positive erodes trust until people start ignoring the alarm. The fix was to stop using the REPL for a yes/no check and read the built configuration directly, with an interpreter call whose exit code is honored.
What this case is really about
Three lessons, each larger than its bug
Retries turn near-misses into duplicates
Automatic retry assumes failures are clean - that a failed attempt did nothing. When an operation is killed mid-flight, after a side effect but before its acknowledgement, the retry does not redo the failure. It doubles the side effect. Any operation that can be retried must be safe to run twice.
Measure the worst case, not the convenient one
The 3-second timeout was set against a warm, fast invocation - a true measurement of an unrepresentative case. The function's real job, running cold in bursts right when scaling happens, was never measured. A limit is a claim about the worst case. Set it against the worst case.
A check is only as good as the mechanism that carries its result
The deploy guard asked the right question and computed the right answer, then handed it to a tool that threw it away. A guard that cannot fail is noise; a guard that cannot pass is worse - it trains people to ignore it.
None of these required deep expertise to fix. The timeout was one number; the guard was one line. What they required was noticing. The expensive part of these bugs was never the fix - it was the silence before anyone looked.
Services applied