Field Notes
Seven Failure Patterns in AI Implementations
Field observations from thirty engagements across nine industries.
The pattern library
After enough engagements, you stop being surprised. The AI press tends to cover exotic failure modes — hallucination, prompt injection, model drift. In our engagements, these are real but rare. What kills implementations is older, dumber, and almost always organizational.
The seven patterns below account for the majority of the failed or stalled AI implementations we have been asked to investigate. They overlap. Most failed projects exhibit three or four of them simultaneously. A project with only one is usually recoverable. A project with five is almost always a write-off.
1. The champion with the wrong incentive
The pilot is sponsored by a VP whose variable compensation is tied to cost reduction. The vendor is chosen to fit a discretionary budget that does not require CFO approval. Eight months later, the cost target is hit and the capability is exactly as small as the original envelope. A larger, more valuable version is never built because expanding it would require scrutiny. The fix: tie the champion to a growth metric from the outset — revenue enabled, capacity reallocated, customer experience measured. Cost incentives produce small AI. Growth incentives produce durable AI.
2. The shadow pipeline
The AI system runs on a pipeline that lives alongside the production pipeline. A 200-line Python script on a data engineer's laptop stitches together the exports that feed the agent. It survives six to nine months. Then the engineer leaves, the production schema changes, the exports silently break, and the AI starts producing outputs that look fine but are based on stale data. Lag between breakage and detection is typically four to eleven weeks. The fix is architectural: AI systems must consume from the same schema-governed pipelines as production reporting.
3. The evaluation that never runs in production
The team builds a thorough eval suite. It passes beautifully on the curated holdout. Then the system ships and the eval is never run against production outputs again. Six months later, quality has drifted — nobody notices until a customer complaint, a wrong answer to an executive, a regulatory surprise. The fix: run a sampled, automated eval on production outputs daily, publish the trend, alert on degradation. Five to ten days of work. Prevents three to twelve months of drift.
4. The integration that is actually a contract
The pilot calls a vendor API from a script. It works. It ships. Nobody negotiated rate limits, error handling, or fallback. Then the vendor changes pricing, deprecates an endpoint, or has an outage. In one engagement, a client was rate-limited to 10% of normal throughput for three weeks because they had crossed an undocumented usage threshold. The fix is procurement-adjacent: load-bearing API dependencies need the same contract discipline as any other mission-critical vendor — SLAs, penalty clauses, change-notice windows, fallback requirements.
5. The "human in the loop" that is actually a rubber stamp
The system has human review on every output. In practice, one reviewer sees 200 items a day, the UI is built for speed, and approval rates are above 97%. What you have built is not oversight. It is legal cover. In regulated industries this is worse than no human at all, because it creates a defensible-looking process that is actually ungoverned. The fix borrows from aviation's Crew Resource Management literature: reviewers must be empowered to disagree, the system must make disagreement cheap and frequent, and the disagreement signal must be captured as training data.
6. The roadmap written in output, not outcomes
"Add document understanding," "expand to second language," "integrate with CRM." These are outputs. Six months in, everything has shipped, and the executive committee asks what has actually changed. The surface area grew; the business did not notice. The fix is a two-column roadmap: outputs (what engineering ships) and outcomes (what business metric moves, by how much, by when, measured how). Every output connects to an outcome. Outputs without outcomes get cut.
7. The cost curve that was never modeled
The team launched on pilot pricing. Nobody modeled unit economics at production volume. Six months in, inference costs are 11x the pilot budget, observability storage is a surprise, and finance is asking questions the team cannot answer. The fix connects back to our work on unit economics: decompose costs into the six layers, project each layer's scaling behavior independently, stress-test against three volume scenarios. A day of work per system. It pays for itself the first time finance asks.
What the patterns share
These seven failure modes share a structural feature: none are about the AI itself. They are about the organizational container — incentives, pipelines, contracts, governance, roadmaps, economics. The practical consequence is that most AI "implementation help" focuses on the wrong layer. Picking a better model, writing a better prompt, or choosing a better vendor addresses less than 20% of what actually kills implementations. The other 80% is organizational, procedural, and financial. The AI is usually the easiest part. The hard part is the company.
If you recognize two or more of these patterns in an implementation you own, the signal is not that you are failing. It is that you are at a fairly normal point in the AI maturity curve, and that addressing them early is much cheaper than addressing them late. [Let us help](/contact).