Skip to main content

Case Study - Incident Analysis

The 3-second timeout that silently corrupted production deployments

A single AWS Lambda configuration value, about one second short of what it needed, let auto-scaled servers join production running broken, half-deployed code.

Lambda timeout vs real runtime

Real cold run~4s
Old limit3s ✗ killed
New limit30s ✓ headroom

One value: 3s -> 30s

Sector

E-commerce API, tens of thousands of orders daily

Footprint

Auto-scaled server fleet behind a load balancer

Cloud

AWS Lambda, EventBridge, Auto Scaling

Engagement

Incident investigation & root-cause fix

At a glance

System
An e-commerce API handling tens of thousands of orders daily.
Symptom
Auto-scaled servers joined the production fleet running broken, half-deployed code.
Root cause
A Lambda function with a 3-second timeout, about one second short of what it needed.
Blast radius
Every automatic scale-out event had a chance to corrupt the new server.
Fix
One configuration value. Timeout 3s to 30s.
The twist
Verifying the fix surfaced a second, unrelated bug: a deploy check that could never pass.

Part 1 - The symptom

A deployment that succeeds but leaves nothing behind

The platform runs its API on a fleet of servers behind a load balancer. When traffic rises, an Auto Scaling Group launches additional servers automatically. Each new server is expected to pull the latest application code, wire itself up, and start serving requests. One week, that stopped being reliable.

New servers were joining the fleet, but some of them were empty. The directory that should hold the application release was bare. The current symlink, which is supposed to point at the active release, pointed at nothing. The deployment logs, meanwhile, said everything had succeeded.

Expected state

current -> releases/20260504-1/

releases/20260504-1/

  application code ✓

Actual state on broken servers

current - - > dangling

releases/ - empty

  (nothing here)

A deployment that reports success but leaves nothing behind is the worst kind of failure: nothing alarms, nothing pages, and the gap is found only when a customer request lands on the hollow server and returns an error.

Part 2 - The setup

Scaling is a relay race between five services

To understand the bug, you need the shape of the system. Scaling is not a single action. It is a relay race between five services, each handing off to the next:

ASG

decides capacity is needed, launches a server

EventBridge

detects the launch event, invokes the Lambda

Lambda

fetches a CI token, triggers a GitHub Actions workflow

GitHub Action

the triggered workflow runs the deployment

New server

deployed, then registered with the load balancer

The Lambda is the smallest component in this chain. Its entire job: receive an ID, fetch a token, make one HTTPS call. It is the kind of function you write once and never look at again.

Part 3 - The investigation

The same request ID appears twice

The first instinct was to blame the deployment script. It was the wrong lead. The breakthrough came from the Lambda's logs. Every invocation writes a REPORT line when it finishes. Scanning a few hours of history surfaced something that should be impossible:

11:16:45  START   RequestId: 9829b926-...
11:16:48  REPORT  RequestId: 9829b926-...  Duration: 3000 ms  Status: timeout
11:17:55  START   RequestId: 9829b926-...
11:17:58  REPORT  RequestId: 9829b926-...  Duration: 3000 ms

The same request ID appears twice. The same EventBridge event ID appears twice. The same new-server ID appears in both.

One scale-out event, processed twice

11:16:45First invocation starts (cold)
11:16:48Killed at exactly 3000 ms - timeout
11:17:55EventBridge retries - same event
11:17:58Second invocation succeeds

When a Lambda invoked asynchronously fails - and a timeout is a failure - the event source automatically retries it. But the first invocation was not a clean failure. It was killed mid-flight, and in those 3 seconds it had already done the one thing that mattered: it had told the CI system to start a deployment. It just had not lived long enough to return a success response. So the retry did not redo a failed deployment. It started a second one.

Part 4 - The root cause

Two deployments racing on one server

Two deployments, both real, both aimed at the same new server, running roughly a minute apart. The deployment script was never designed for that. Like most deploy scripts, it assumes it is the only one running. Two copies, racing on the same machine, interleave their steps.

Two deploys racing on one server

Deploy A (first)

upload release files
cleanup

Deploy B (retry)

upload release files
cleanup

Deploy B's cleanup deletes old releases - including the directory Deploy A is still living in. The files vanish under a running process.

Now, why did the first invocation time out at all? The Lambda's timeout was 3 seconds. Whoever set it almost certainly measured a warm invocation: a cached token, one fast HTTPS call, done well under a second. But a Lambda that has not run recently starts cold - it must initialize its runtime first.

Warm invocation - fits

work ~0.8s, well under the 3s limit

Cold invocation - doomed

init + work ~4s, killed at the 3s mark

Cold starts are common precisely when scaling happens - scale-out events arrive in bursts after quiet periods. The timeout was not an edge case. It was the normalcase for this function's actual job.

Part 5 - The fix

One value: 3 seconds to 30 seconds

The fix was one value. The Lambda's timeout was raised from 3 seconds to 30. That is not a tuned number - the real worst case is about 4 seconds; 30 is a deliberate, generous ceiling. You are billed for actual execution time, not the limit, so an over-generous timeout costs nothing. A too-tight one cost a corrupted server. The asymmetry says: leave room.

Verified empirically on the next scale-out

  • Two Lambda invocations fired - two servers launched
  • Each had a distinct request ID - no retry, no duplication
  • Each ran about 4 seconds and completed cleanly, no timeout marker
  • Each deployment targeted its own server

One real event, one deployment, one healthy server. The relay ran clean.

Part 6 - The twist

A deploy guard that could never pass

While watching deployments to confirm the timeout fix, a second failure appeared, unrelated and in some ways more interesting. Deployments were now failing at a verification step. After the deploy script rebuilt the application's configuration cache, a guard checked that the cache was valid. The guard was written like this:

php artisan tinker --execute='exit(empty(config("app.key")) ? 1 : 0);' \
  || { echo "ERROR: config build failed"; exit 1; }

The intent is sound: run a line of code, exit 0 if the key is present, exit 1 if not. The problem: the tool running that line, an interactive REPL in --execute mode, does not treat exit() as a normal exit. It intercepts it as an internal control exception and terminates with a non-zero exit code regardless of the value passed. So exit(0) and exit(1) produced the same failing code. The guard could never pass.

How correct information was destroyed

config cache built correctly
guard: is app.key present?
run the check via REPL --execute
REPL catches exit() as an exception, discards the code
exits non-zero - always
guard reads non-zero, reports FAILED
deploy aborted

This bug had the opposite shape of the first. The timeout was a real failure that reported success. The guard was a real success that reported failure. Both are dangerous: the false negative quietly corrupts, the false positive erodes trust until people start ignoring the alarm. The fix was to stop using the REPL for a yes/no check and read the built configuration directly, with an interpreter call whose exit code is honored.

What this case is really about

Three lessons, each larger than its bug

Retries turn near-misses into duplicates

Automatic retry assumes failures are clean - that a failed attempt did nothing. When an operation is killed mid-flight, after a side effect but before its acknowledgement, the retry does not redo the failure. It doubles the side effect. Any operation that can be retried must be safe to run twice.

Measure the worst case, not the convenient one

The 3-second timeout was set against a warm, fast invocation - a true measurement of an unrepresentative case. The function's real job, running cold in bursts right when scaling happens, was never measured. A limit is a claim about the worst case. Set it against the worst case.

A check is only as good as the mechanism that carries its result

The deploy guard asked the right question and computed the right answer, then handed it to a tool that threw it away. A guard that cannot fail is noise; a guard that cannot pass is worse - it trains people to ignore it.

None of these required deep expertise to fix. The timeout was one number; the guard was one line. What they required was noticing. The expensive part of these bugs was never the fix - it was the silence before anyone looked.

Services applied

What the engagement covered

Production incident investigation
AWS Lambda & EventBridge analysis
Deployment pipeline review
Root-cause diagnosis
CI/CD verification fix

Related service

Monitoring & Observability

More case studies

Facing something similar?

A short conversation is usually enough to tell whether it is worth a deeper look. No commitment to start.