Why most AI integrations die between the demo and production
Three patterns we see kill AI projects, and the engineering choices that prevent each one.
The first AI demo always works. Hand-curated inputs, the right model, ten minutes of prompt-tweaking, a board meeting, excitement, budget. Six months later the team is still trying to ship it.
This isn’t because AI doesn’t work. It’s because the demo and production solve different problems. The demo proves a concept. Production has to handle every input a real customer will throw at it without falling over, hallucinating, or running up a $40k OpenAI bill on a Tuesday.
After enough engagements, three patterns repeat.
1. No evals
Most AI projects we audit have zero automated regression tests. Someone writes the prompt, eyeballs a few outputs, ships it. Two weeks later someone tweaks the prompt to fix a different edge case. The original outputs now silently regress. Nobody finds out until a customer support ticket lands.
The fix isn’t fancy. Write evals before features: a small set (20–50) of input/expected-output pairs that runs on every prompt change. When you swap models, change vendors, or refactor retrieval, the eval catches the regression in CI, not in production a month later.
If you can’t write evals for it, you don’t understand the problem well enough to ship it.
2. Treating AI like deterministic code
LLMs are stochastic. Same input, different outputs. APIs that “almost always” work, until they don’t.
Code that calls an LLM and assumes the response will parse, the JSON will be valid, the tool call will happen, the user will get a useful answer — that code breaks in interesting ways at scale.
We write LLM calls assuming they’ll fail. Structured outputs (function calling, schemas), retries with exponential backoff, fallback chains (if Claude fails, try GPT; if GPT fails, return a deterministic answer). Latency budgets that cap how long a user waits before we serve a degraded response.
The boring engineering that turns AI from “interesting” into “doesn’t break the rest of the product.”
3. Wrong UX
The default AI UX in 2026 is a chatbot. Type a question, get an answer. This is almost always the wrong shape.
Most AI work that ships well is invisible. It’s a better search index. It’s a one-click “categorize this” button. It’s a draft email already written when the user opens the compose window. It’s an agent that runs in the background and surfaces actionable items.
Chatbots punish users with a blank input box. They have to know what to ask, how to phrase it, what the bot can do. Invisible AI makes a decision the user would have made anyway, and saves them the typing.
When we audit a use-case, the first question is: can we remove the prompt entirely? The answer is yes more often than the team expects.
What we actually do
Before a line of LLM code, a use-case audit. Score each candidate by user value × technical feasibility × cost-per-call. Cut anything where the answer to “could we do this with a SQL query and a CASE statement” is yes.
Then a spike: end-to-end, in your stack, with real data, behind a feature flag. Eval harness from day one. Streaming, retries, fallbacks before the public flag flip.
Then we measure. Not just whether it works, but whether the latency budget holds, whether the cost-per-call is sustainable, whether quality stays above the threshold across the long tail of inputs the demo never tested.
AI is plumbing
The interesting part of an AI project is usually 10% of the work. The other 90% is the same disciplined engineering that makes any feature ship: failure modes, observability, evals, cost guards, cache.
The teams that get AI to production are the ones that take the plumbing seriously. The teams whose AI projects die between demo and launch are the ones who thought the demo was the hard part.
If you’re scoping an AI integration and want a second opinion on whether it’ll survive contact with production, we do those. Discovery calls are free and end with a yes or no, not a sales pitch.