Jun 13, 2026

Text-to-SQL Is a Product Problem, Not a Prompting Problem

The first time we put a Text-to-SQL system in front of real users, the answers were bad. Not “needs a better prompt” bad — bad in a way that makes people stop trusting it after the second wrong number.

The instinct in that moment is to reach for the model: a sharper prompt, a bigger context window, a fine-tune. That’s the wrong lesson. The demo was never the hard part. Getting a business user to trust an answer they didn’t write the SQL for — that’s the hard part, and it’s a product problem, not a prompting problem.

Trash in, trash out

The single biggest source of wrong answers wasn’t the model misreading the question. It was the data underneath: tables nobody had described, columns whose names lie, business terms that mean three different things in three teams. If the system doesn’t know which of four “customer” tables is the real one, or that two teams mean different things by “active user,” no prompt will save it.

This is unglamorous and it is the whole game. Most of the work that makes Text-to-SQL trustworthy happens before the model is ever called: cleaning up descriptions, pinning down a glossary, mapping relationships, labeling what’s sensitive. Trash in, trash out. A model on top of undescribed data is a confident liar.

It also means the first conversation with a customer is about expectations, not technology. You are not selling magic AI. You are selling a disciplined system that is only as good as the data and the methodology you feed it — and saying that out loud early is what keeps the project from dying at the first wrong answer.

Trust is a set of product decisions

Once the data foundation is there, trust comes from a handful of deliberate product choices — none of them about the model itself:

Let it refuse. The most important behavior is the system knowing when it can’t answer. If the data can’t support a reliable answer, it should say so, not invent a plausible one. A system that abstains is worth more than one that’s confidently wrong.
Show the plan, ask for confirmation. Before running anything, show the user what the system understood and what it’s about to do, and let them confirm. Old, boring UX, and it works.
Learn from real use. Every question, correction, and abandoned query is signal. Continuously tuning on real interactions beats any one-time prompt-engineering pass.

What “quality” actually means here

The trap is to measure whether the generated SQL is syntactically correct, or whether it runs. That’s table stakes and tells you almost nothing. The real question is: would a domain expert accept this answer?

So that’s what to measure — not “did it produce SQL,” but the acceptance rate of answers a human reviewer would sign off on, against a curated set of reference questions where the right answer is known. Two numbers matter more than raw accuracy:

Acceptance rate — of the answers it gives, how many would an expert actually trust and use.
Refusal quality — when the data can’t support an answer, does it correctly decline instead of guessing?

A system that answers 95% of questions but fabricates 10% of them is worse, in an enterprise, than one that answers 70% and reliably refuses the rest. Someone’s reputation is on the line when they repeat that number in a meeting. Confident hallucination isn’t a small bug — it’s the thing that kills adoption.

The counterintuitive part

I came in assuming the hard, defensible part would be the model — getting it to generate correct SQL from messy natural language. It’s the opposite. Generation is commoditizing fastest; soon every data catalog will auto-write column descriptions and passable queries. That’s not where the product lives.

The durable, hard part is everything around the model: the data context that makes answers correct, and the machinery that makes them provably trustworthy — confidence, human acceptance, the discipline to refuse. The product isn’t “AI that answers questions about your data.” It’s “AI that knows when it can’t.” That distinction turned out to be the whole value.

If you take one thing

Before you reach for Text-to-SQL — or any AI on top of your data — fix the data quality and the methodology first. The model is the easy part. The trust is the product.