Skip to main content
Back to Blog

Anatomy of a 94.5% Pass Rate

How to triage the last 18 SQL conversion failures — what to fix, what to defer, and how to know the difference.

Steve HarlowApril 25, 20268 min read

A full-corpus baseline pilot through ReportBridge's T-SQL to PostgreSQL conversion pipeline landed at 94.5% effective pass — 312 of 330 scorable datasets across nearly 200 reports from a multi-jurisdictional regulatory program. The 18 datasets that failed are the long tail. This is what the triage looked like.

The point of the triage is not to push the number to 100%. It is to decide, for each remaining failure shape, whether the cost of fixing it earns a real return — and to write down the reasoning either way so the next person who looks at the residuals does not have to re-derive it.

Related reading: the recursive-CTE bug post covers three of the deterministic scrubs that earlier moved this same pipeline from 84.0% to 89.6%, and the LLM-failover post-mortem covers the resilience layer that catches the calls these residuals would otherwise stall on.

The five classes

Eighteen failures sorted into five classes once you grouped them by error shape:

  • Hallucinated columns (12). column X does not exist — the AI referenced a column that was not in the schema block fed to the prompt. Examples: ec.keydatetime, g.type, mu.dmvmake.
  • UNION type mismatch (7). UNION types integer and character varying cannot be matched — PostgreSQL is strict about UNION arms agreeing on type; T-SQL is more permissive, so the source query never had to be explicit.
  • Generic syntax (5). syntax error at or near "(" and similar — token-budget or identifier-preservation issues that don't cluster into a learnable pattern.
  • Operator mismatch (3). operator does not exist: text = date — CAST guidance in the prompt is correct but not always followed.
  • Empty-string-to-int (1). A single dataset where an AI-fixed expression cast an empty string to integer.

The class counts sum to 28 because the residual classification uses a primary-error-class per dataset and some datasets match multiple classes. Eighteen unique datasets remain after AI-Fix.

Three tiers

Tier 1 — fix it (the only ROI no-brainer)

Hallucinated columns. Twelve of the eighteen failures. Closing 6-8 of them lifts the pass rate by 1.8 to 2.4 percentage points. The fix is a single change inside AI-Fix r1: after the model returns SQL, parse it for <alias>.<column> references, cross-check against the schema block that was fed to the prompt, and if any references are not in the schema, build an r2 prompt with explicit allowed-columns and retry once.

A pre-step quotes PG-reserved-word identifiers like type, user, and order, which catches a subset of the residuals where the AI's SQL was correct except for missing quotes.

Risk is low: prompt-only behavior change, with fail-open semantics if the parser cannot read the SQL. Effort is 3-4 hours including jest tests. This is the queued next item.

Tier 2 — fix it carefully (prompt first, scrub only if needed)

UNION type mismatch. Seven of the eighteen. Closing 4-7 lifts the pass rate by 1.5 to 2.1 percentage points. The earlier prompt block already addresses the recursive-CTE subset of this class, but seven non-recursive cases survived. The fix is a directional CAST rule: when the error is UNION types integer and character varying cannot be matched, cast the integer side to text — never the text side to integer, because that fails on non-numeric strings like 'N/A'. The rule generalizes: text is the universal sink; every other type can be losslessly cast to text, but not vice versa.

The careful part is the order of operations. Pilot the prompt-only change first against the seven specific datasets. Only if r2 is unreliable on these specific cases do you build a scrub-side cast injection — and then with conservative direction, because a wrong-direction cast can produce silently wrong data.

Tier 3 — leave it (and write down why)

Generic syntax errors. Five residuals, heterogeneous patterns, no clear cluster. Earlier CONV-N items already address this class at the prompt layer; the survivors are stragglers. Implementation cost is unbounded because there is no shared shape; expected lift is bounded by the per-failure rarity. Don't build.

Operator mismatch (text = date and friends). Three residuals. The prompt's CAST guidance is correct; it's just not always followed. Already deferred in an earlier audit; honor that deferral until the class becomes customer-blocking.

Empty-string-to-integer. Single dataset, niche failure mode in the AI-Fix output rather than the source SQL. If it falls out of the Tier 1 schema-strict change, great; if not, leave it as a known acceptable residual.

The math after Tier 1 and Tier 2 ship

Today the baseline is 312 of 330 scorable datasets, or 94.5% effective. If Tier 1 closes 7 and Tier 2 closes 6, the projection is:

  • After Tier 1: 319 / 330 ≈ 96.7%
  • After Tier 2: 325 / 330 ≈ 98.5%

The remaining five (3 operator mismatch + 1 empty-string-int + 1 stray syntax) are below the ROI threshold. They're what the triage calls AI-capability-ceiling cases. Pushing past that line without a customer-driven reason is how migration projects get stuck in the last-mile loop forever.

Why the deferral discipline matters more than the lift estimate

The hardest part of triaging the long tail is not finding the fixes. It's building the discipline to leave a residual alone once you've decided to. A migration codebase accretes CONV-N-style proposals faster than it ships them. Without a written deferral rationale, every quarterly review becomes a re-litigation of the same backlog.

Each deferred item earns a one-paragraph rationale that the next reader has to overrule:

  • "Already addressed by AI-Fix r2 in two consecutive pilots"
  • "Below ROI threshold — single-file niche"
  • "Customer policy: don't build until requested"

All of these are valid. "We didn't get to it" is not. The list of reasons becomes the institutional memory of why the codebase looks the way it does, and that's worth more than another two percentage points of pass rate.

What this looks like for your own pipeline

If you're running an AI-augmented SQL conversion in your own project, here's the punch list:

  1. Capture the post-fix residuals from a real pilot, not synthetic cases. Group them by error shape, not by report.
  2. Sort the shapes by count. The largest 1-2 classes will usually account for more than half the failures.
  3. For each top class: write the proposed fix in one paragraph. Estimate effort honestly (jest tests + pilot included). Estimate expected lift in percentage points based on a realistic close-rate.
  4. Tier the proposals using the three questions in the FAQ above. Implement Tier 1; pilot Tier 2 prompt-only first; defer Tier 3 with written rationale.
  5. After each Tier 1/2 ship, re-pilot against the same baseline with --reset so the cache doesn't mask the actual lift.
  6. Stop building before you hit the AI-capability ceiling. Document the residuals as known acceptable.

FAQ

Why not just keep tuning until 100%?

The last 5% of a conversion pipeline is where ROI inverts. Each remaining failure usually represents a different shape of problem — they don't cluster the way the first 50% do, so the cost of fixing one rarely amortizes across others. The right move past ~95% is to admit the long tail is heterogeneous and only fix what produces a measurable lift on a real workload, then leave the rest documented as known-acceptable residuals with rationale. The triage exists to draw that line on purpose instead of by accident.

What is a CONV-N candidate, and why number them?

Each candidate is a specific code-or-prompt change targeting a specific failure shape — for example, CONV-18 targets hallucinated columns, CONV-19 targets non-recursive UNION type mismatches. Numbering matters because the same shape can recur in future workloads, so when a new pilot surfaces a failure, the first question becomes 'is this a known shape?' If yes, point at the existing CONV-N closure for context. If no, it earns a new number. The numbering doubles as version history for the pipeline.

How do you decide tier 1 vs tier 2 vs tier 3?

Three questions. (1) How much of the residual count does this shape close? Tier 1 closes 6+, tier 2 closes 4-7, tier 3 closes ≤3. (2) What's the implementation effort? Tier 1 is ≤4 hours of focused work; tier 2 is ≤6 hours plus a follow-up pilot; tier 3 is anything that needs more than that. (3) What's the risk? Tier 1 is prompt-only or fail-open code paths. Tier 2 may include a scrub layer with limited blast radius. Tier 3 includes anything that could produce silently-wrong data. The combined answer fits one of three tiers cleanly.

What's the deferral discipline?

Each deferred item gets a written rationale that the next reader has to overrule. 'Already addressed by AI-Fix r2 in two consecutive pilots,' 'below ROI threshold,' or 'wait for a customer to ask' all count as valid rationale. 'We didn't get to it' does not. The deferral list is treated like a one-way door — quietly promoting an item back to active without re-reading the why is a code smell. The same rule applies to memory-stored decisions: the rationale is part of the decision.

ReportBridge handles SSRS to PostgreSQL migration with AI-powered T-SQL conversion, deterministic scrub passes, and a UI that makes the triage above visible per dataset. Three-layer validation catches silent failures before they reach production.

Try ReportBridge for your migration