Skip to main content
Back to Blog

Three Things We Got Wrong Before Getting LLM Failover Right

A wrong-hypothesis, a broken fix, and a 30-second canary that would have saved 14 minutes.

Steve HarlowApril 25, 20269 min read

ReportBridge's SQL conversion pipeline routes AI-Fix calls through Anthropic with an automatic Bedrock failover. The wrapper looks straightforward: try Anthropic; if it throws, log a warning and dispatch the same prompt to Bedrock. A pilot last week recorded zero failover events despite three confirmed Anthropic 503 storms during the run. The wrapper was deployed and the IAM was correct. The failover just was not firing.

Three iterations were attempted before the path actually closed. The first iteration was based on a wrong hypothesis. The second iteration broke production. The third one worked, but only partway — and surfaced a fourth gap that the first two iterations had been masking. This is what happened, in the order it happened, so you can avoid the same loop.

For the conversion-side context: a separate post on triaging the last 18 SQL conversion failures covers what the AI-Fix retries that this failover protects are actually doing — and the methodology for deciding which residuals are worth another retry round at all.

Iteration 1: the wrong hypothesis

The first theory was natural and wrong. The Lambda had a 5-attempt retry ladder on Anthropic 503s with exponential backoff totaling about 107 seconds, plus an outer wrapper that could compound the delays. Theory: that ladder blew past the API Gateway 30-second integration timeout, so the gateway returned 503 to the client while the Lambda was still in retry-backoff. By the time the Lambda gave up and the failover wrapper's catch fired, the client had been served a 503 already.

The fix matched the theory: shrink the retry ladder. Two attempts, backoffs of 2 and 5 seconds, plus a flag that prevented the outer wrapper from compounding the delays. The new total was about 17 seconds — leaving 13 seconds inside the gateway window for Bedrock to dispatch.

Verification pilot: zero [bedrock-failover] events. Same pilot output, same datasets lost to the 503s, same no-internal-logs invocations of 35-90 seconds visible in CloudWatch. The retry-ladder change was clean and shipped, but the hypothesis was wrong: the slow path was not the retry ladder.

Iteration 2: the right diagnosis, the wrong tool

Adding a one-line entry-log before the Lambda's action router produced the missing signal. CloudWatch now showed every invocation paired with its action name and its REPORT duration. The slow class was anthropic-proxy at 35-90 seconds, with only START / END / REPORT lines and nothing in between.

That "nothing in between" was the diagnostic. The failover wrapper's catch logs a warning the moment Anthropic throws. If the warning was missing, the throw was missing too — meaning the call to Anthropic was not throwing. It was hanging.

Root cause located: the underlying https.request helper had no per-call timeout. When Anthropic accepted the connection but never responded, Node.js kept the socket open and the await resolved never. The Lambda ran for 35-90 seconds before its outer 5-minute timeout would eventually have ended it; the gateway returned 503 at second 30 and the Lambda continued running, in silence.

The fix matched the diagnosis: add a timeoutMs option to the helper, default to no timeout (every other caller unaffected), have the Anthropic call pass 12000 ms. Math: 12-second timeout on a single attempt, no retry, fails fast, wrapper catch fires by second 12, Bedrock has 18 seconds inside the 30-second gateway window. Clean.

Pilot: 76.5% to 5.9%. Sixteen of seventeen datasets dropped to "AI conversion pending" — every ws-convert-sql call timed out at 12 seconds. The 16K-token initial conversion legitimately takes 15-25 seconds. The 12-second timeout was correct math for the failover scenario and wrong math for normal traffic. Production was at 5.9% for about 15 minutes between the bad deploy and the rollback.

Iteration 3: the safety net, with a footnote

The right timeout for the safety-net role is 90 seconds. Long enough that any normal AI completion finishes well inside, short enough that infinite hangs are caught well before the 5-minute Lambda runtime cutoff. Pilot on the 90-second build restored the 76.5% baseline and produced the first ever [bedrock-failover] event in CloudWatch:

[bedrock-failover] Anthropic failed (request timeout after 90000ms
to api.anthropic.com/v1/messages); trying Bedrock
[bedrock-failover] Bedrock also failed (The provided model identifier
is invalid.)

Two findings out of one log line. First: the failover path now works end-to-end — the timeout fires, the wrapper catches, the Bedrock attempt is dispatched and logged. The whole chain is observable. Second: Bedrock got further than it ever had, and then surfaced a new problem one layer deeper. The model ID map in the Bedrock client did not match what was actually enabled in the AWS account. Iteration 3 fixed the path and exposed the next bug.

That kind of layered hiding is normal in LLM resilience work. Each fix uncovers what the previous bug was masking. The honest statement of where iteration 3 ends is "safety net works, failover path observable end-to-end, model ID map is the next queued fix, and synchronous failover within the 30-second gateway window is an architectural problem that the current code cannot solve without an async-poll pattern."

The 30-second canary lesson

The cheapest part of this whole story is the lesson about iteration 2. A single fast call against a known-good payload — one small RDL through ws-convert-sql — would have caught the 12-second-timeout regression in seconds. Instead the regression went straight to a 14-minute pilot, which dutifully ran 17 datasets at 5.9% effective and reported the disaster at the end.

The new rule, written down so the next iteration honors it: when changing Lambda timeout or retry defaults, run a single canary call exercising the changed code path before launching any multi-minute pilot. A canary costs about 30 seconds; a failed pilot costs 14 minutes plus a rollback deploy plus a verification re-pilot. The math is unambiguous.

What you can take to your own LLM resilience work

  • If your failover wrapper is logging zero events, the call is hanging, not throwing. Look for missing timeouts in your HTTP helper before you tune the retry ladder. The retry ladder is upstream of the actual bug.
  • Per-call timeouts beat global timeouts. A single timeout value applied to all HTTP calls in a Lambda forces you to choose between "short enough for failover" and "long enough for normal traffic." Per-call lets the AI dispatch path use a different value than the auth token refresh path.
  • Synchronous LLM-backed APIs behind API Gateway are an anti-pattern past 30 seconds. If your prompts regularly produce responses in that range, the answer is async-poll, not a more clever timeout.
  • Always entry-log the action router. One log line before the switch statement turns a black-box Lambda into one where every invocation is correlatable to a handler in CloudWatch. This is the cheapest infrastructure observability tool that exists.
  • Canary before pilot. One fast call against a known-good payload, every time. Production is a bad place to find out the timeout was wrong.

FAQ

Why does API Gateway have a 30-second integration timeout?

AWS API Gateway HTTP API enforces a hard 30-second integration timeout that is not configurable. The REST API variant has the same default but can be raised to 30 seconds with a service request — same ceiling. The reason is that API Gateway is designed to be a synchronous front door for many backends; allowing per-route timeouts past 30 seconds would let one slow backend tie up gateway capacity. For long-running synchronous calls (like a 16K-token LLM completion), the architectural guidance is to convert the route to an async-poll pattern: client posts a job, gets a job ID, polls a separate endpoint for status. The synchronous pattern is incompatible with calls that legitimately exceed 30 seconds.

Why didn't a per-call timeout immediately fix it?

Two reasons. First, the timeout has to be tuned to the workload, not to the failover scenario. A timeout short enough to fit Bedrock failover inside the 30-second gateway window (about 12 seconds) breaks every call that legitimately takes longer — and AI calls on 16K-token responses routinely take 15-25 seconds. Second, even with a correct timeout, the gateway window itself is the binding constraint. The 12-second timeout in our case dropped pass rate from 76.5% to 5.9% in one pilot. The right safety-net timeout (90 seconds) catches infinite hangs but is too long for transparent client-side failover. Solving both problems requires moving to async poll, which is a bigger change than a one-line timeout.

What is a canary call and why is it cheaper than a full pilot?

A canary is a single, short-running call against a known-good payload that exercises the changed code path. For an AI conversion Lambda, the canary is one small RDL conversion that should complete in 5-15 seconds. Total cost: about 30 seconds. A full pilot runs 9 RDLs through three rounds of conversion, AI-Fix, and EXPLAIN against the workspace database — about 14 minutes. The canary catches catastrophic regressions (like the 12-second-timeout-breaks-everything one) immediately. The full pilot is necessary for measuring effect on real workloads. The mistake was running the full pilot without a canary first; adding the canary changes the cost of failed iterations from 14 minutes to 30 seconds.

What's left after this post — what's the right architecture for synchronous LLM-backed APIs behind API Gateway?

Three approaches, ordered from least to most invasive. First, the async-poll pattern: split each LLM-backed route into a job-create endpoint that returns a job ID immediately and a status endpoint that returns the result when ready. This decouples the client wait from the gateway window. Second, EventBridge or SQS-based dispatch: push the job onto a queue and let a worker handle it — clients receive a job ID via WebSocket or polling. Third, direct invocation of Lambda from the client (skipping API Gateway entirely): reasonable for internal tools where the gateway's auth and rate-limit features can be replaced. The right choice depends on existing infrastructure and how much of the client surface needs to change. For a migration tool with a small known client surface, async-poll is the smallest delta.

ReportBridge runs an AI-augmented SQL conversion pipeline with explicit failover, timeouts, and observability. The pieces described in this post are running in production today, with the architectural follow-ups noted in the open queue.

Try ReportBridge for your migration