A working demo and a working system aren’t the same thing, and with AI it’s surprisingly easy to mistake one for the other. A prototype can look finished — fluent output, a clean run in front of the room — while still being a long way from doing that same job in production, day after day, on messy real data, inside systems that were never built with it in mind. Getting from the first to the second is most of the actual work, and it’s usually the part not correctly budgeted for.

Part of what makes it so easy to underestimate is that the impressive part is the bit you can see. When a model is already producing sharp, fluent answers, it feels like most of the job is done, when in fact that fluent output is the cheap part. The reliability, the integration, and the handling of everything that doesn’t go to plan are where the real cost sits, and almost none of that is visible while you’re watching a demo. It’s a wide enough gap that one widely cited MIT study last year found around 95% of corporate AI pilots delivered no measurable return — a figure I’d treat with caution, but one that broadly matches what I have seen.

So where does that cost actually go? There’s no single answer, and I’d be wary of anyone who claims otherwise. Plenty of projects stall on a vague goal, an absent owner, a budget that runs out, or a team that doesn’t really want to work differently. But a few problems come up again and again, and they tend to have less to do with the model than people expect.

The first is data. A demo usually runs on a tidy sample that someone pulled together by hand, whereas the real data is spread across several systems, partly out of date, full of fields that don’t quite line up, and shifting all the time. The model hasn’t got any worse; the ground underneath it has just moved, and that’s often enough to throw the whole thing off.

The second is integration. Sitting a model next to a fifteen-year-old core system has very little to do with prompting and a lot to do with the unglamorous work around it: latency budgets, sensible retries when a call times out, doing something safe when an answer comes back malformed, and generally making all of it behave inside systems that expect fast, predictable, synchronous responses.

The third is the one often skipped, and the one I’d worry about most: agreeing on what “good” even means before anything gets built. In a demo you can eyeball ten outputs and form a view, but in production there are thousands of them, nobody is reading each one, and when the model is wrong it’s rarely wrong in an obvious way; it’s wrong while sounding completely sure of itself. Unless you’ve decided in advance what a good result looks like and built some way to measure it at scale, you can’t really know whether the thing is working, so it ends up half-trusted and slowly falling out of use.

None of this is the AI’s fault. Once the model becomes the easy part, the difficulty just moves somewhere else: into your data, your systems, and the decisions you make long before anyone writes a line of code.

The teams I’ve seen handle this well tend not to start with the model at all. They start with duller questions: where the data actually lives and whether it can be trusted, what the thing has to connect to and what that will really take, what a good outcome looks like and how anyone will know when they’ve got one. It’s a slower and far less impressive way to begin, but they’re usually the ones still running something a year later.

I don’t want to make this sound tidy, because it isn’t. The gap between a demo and production is only one piece of a much bigger picture, and plenty of projects come apart for reasons that sit well outside it. It’s just the piece I see underestimated most often, and one of the cheaper ones to get right, provided you look at it before a good demo sets expectations you’ll then struggle to meet.

If you’re about to start something, or you’ve got a prototype that won’t quite make the jump, it’s worth working out where the real difficulty is going to be before you commit to it, because it’s usually not where the demo made it look.


Trying to get something past the demo and into production? I’d be interested to hear where it’s getting stuck. Reach me on LinkedIn here.